We are going to study on the factors that would affect the accommodation price on Airbnb in Brussels, and make a prediction on the total cost for 2 people staying 4 nights using proper model.

First of all, we performed exploratory data analysis to get an overall understanding of the data and examine the data in more detail. Next, we did some mappings to show our key insights. Then we construct the models, analyse the results and wrap it up in our conclusion.

1 EDA

Before we start, we need to import the data and conduct some basic exploratory data analysis.

1.1 Raw Data

#Import data
listings_csv <- vroom::vroom("http://data.insideairbnb.com/belgium/bru/brussels/2020-06-15/data/listings.csv.gz")

#save as another dataset
listings <- listings_csv

#take a glimpse of all the data
glimpse(listings)
## Rows: 8,986
## Columns: 106
## $ id                                           <dbl> 2352, 2354, 45145, 481...
## $ listing_url                                  <chr> "https://www.airbnb.co...
## $ scrape_id                                    <dbl> 2.02e+13, 2.02e+13, 2....
## $ last_scraped                                 <date> 2020-06-16, 2020-06-1...
## $ name                                         <chr> "Triplex-2chmbrs,grand...
## $ summary                                      <chr> "Cute 2 bedrooms appar...
## $ space                                        <chr> "ENGLISH VERSION: The ...
## $ description                                  <chr> "Cute 2 bedrooms appar...
## $ experiences_offered                          <chr> "none", "none", "none"...
## $ neighborhood_overview                        <chr> "Basilique Koekelberg,...
## $ notes                                        <chr> "Les événements et les...
## $ transit                                      <chr> "bus 49, tram 82 direc...
## $ access                                       <chr> "easy from the ring by...
## $ interaction                                  <chr> NA, NA, NA, NA, "If yo...
## $ house_rules                                  <chr> "CHECK-IN. Regular che...
## $ thumbnail_url                                <lgl> NA, NA, NA, NA, NA, NA...
## $ medium_url                                   <lgl> NA, NA, NA, NA, NA, NA...
## $ picture_url                                  <chr> "https://a0.muscache.c...
## $ xl_picture_url                               <lgl> NA, NA, NA, NA, NA, NA...
## $ host_id                                      <dbl> 2582, 2582, 199370, 21...
## $ host_url                                     <chr> "https://www.airbnb.co...
## $ host_name                                    <chr> "Oda", "Oda", "Erick",...
## $ host_since                                   <date> 2008-08-28, 2008-08-2...
## $ host_location                                <chr> "Belgium", "Belgium", ...
## $ host_about                                   <chr> "Hi there! I've been a...
## $ host_response_time                           <chr> "within an hour", "wit...
## $ host_response_rate                           <chr> "100%", "100%", "N/A",...
## $ host_acceptance_rate                         <chr> "N/A", "N/A", "N/A", "...
## $ host_is_superhost                            <lgl> FALSE, FALSE, FALSE, F...
## $ host_thumbnail_url                           <chr> "https://a0.muscache.c...
## $ host_picture_url                             <chr> "https://a0.muscache.c...
## $ host_neighbourhood                           <chr> "Molenbeek-Saint-Jean"...
## $ host_listings_count                          <dbl> 3, 3, 2, 1, 1, 1, 16, ...
## $ host_total_listings_count                    <dbl> 3, 3, 2, 1, 1, 1, 16, ...
## $ host_verifications                           <chr> "['email', 'phone', 'r...
## $ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE...
## $ host_identity_verified                       <lgl> FALSE, FALSE, TRUE, FA...
## $ street                                       <chr> "Sint-Jans-Molenbeek, ...
## $ neighbourhood                                <chr> "Molenbeek-Saint-Jean"...
## $ neighbourhood_cleansed                       <chr> "Molenbeek-Saint-Jean"...
## $ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA...
## $ city                                         <chr> "Sint-Jans-Molenbeek",...
## $ state                                        <chr> "Brussels", "Region Br...
## $ zipcode                                      <dbl> 1080, 1080, 1210, 1150...
## $ market                                       <chr> "Brussels", "Brussels"...
## $ smart_location                               <chr> "Sint-Jans-Molenbeek, ...
## $ country_code                                 <chr> "BE", "BE", "BE", "BE"...
## $ country                                      <chr> "Belgium", "Belgium", ...
## $ latitude                                     <dbl> 50.9, 50.9, 50.9, 50.8...
## $ longitude                                    <dbl> 4.31, 4.31, 4.37, 4.41...
## $ is_location_exact                            <lgl> TRUE, TRUE, FALSE, TRU...
## $ property_type                                <chr> "Apartment", "Apartmen...
## $ room_type                                    <chr> "Entire home/apt", "En...
## $ accommodates                                 <dbl> 5, 4, 2, 2, 3, 2, 3, 3...
## $ bathrooms                                    <dbl> 1.0, 1.0, 1.0, 1.5, 1....
## $ bedrooms                                     <dbl> 2, 1, 1, 2, 1, 1, 0, 0...
## $ beds                                         <dbl> 2, 1, 1, 2, 1, 1, 2, 2...
## $ bed_type                                     <chr> "Real Bed", "Real Bed"...
## $ amenities                                    <chr> "{TV,\"Cable TV\",Inte...
## $ square_feet                                  <lgl> NA, FALSE, NA, NA, NA,...
## $ price                                        <chr> "$93.00", "$78.00", "$...
## $ weekly_price                                 <chr> "$625.00", "$546.00", ...
## $ monthly_price                                <chr> "$1,800.00", "$1,495.0...
## $ security_deposit                             <chr> "$150.00", "$150.00", ...
## $ cleaning_fee                                 <chr> "$65.00", "$65.00", "$...
## $ guests_included                              <dbl> 4, 1, 1, 1, 2, 2, 2, 2...
## $ extra_people                                 <chr> "$5.00", "$0.00", "$30...
## $ minimum_nights                               <dbl> 2, 2, 1, 2, 5, 2, 1, 1...
## $ maximum_nights                               <dbl> 365, 365, 730, 14, 120...
## $ minimum_minimum_nights                       <dbl> 2, 2, 2, 2, 5, 2, 1, 1...
## $ maximum_minimum_nights                       <dbl> 2, 2, 4, 2, 5, 2, 1, 1...
## $ minimum_maximum_nights                       <dbl> 1125, 1125, 730, 14, 1...
## $ maximum_maximum_nights                       <dbl> 1125, 1125, 730, 14, 1...
## $ minimum_nights_avg_ntm                       <dbl> 2, 2, 2, 2, 5, 2, 1, 1...
## $ maximum_nights_avg_ntm                       <dbl> 1125, 1125, 730, 14, 1...
## $ calendar_updated                             <chr> "2 months ago", "2 mon...
## $ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE...
## $ availability_30                              <dbl> 29, 28, 29, 30, 26, 0,...
## $ availability_60                              <dbl> 57, 58, 56, 60, 56, 0,...
## $ availability_90                              <dbl> 87, 88, 64, 90, 86, 0,...
## $ availability_365                             <dbl> 267, 363, 338, 365, 36...
## $ calendar_last_scraped                        <date> 2020-06-16, 2020-06-1...
## $ number_of_reviews                            <dbl> 15, 2, 3, 0, 105, 12, ...
## $ number_of_reviews_ltm                        <dbl> 0, 0, 0, 0, 11, 4, 0, ...
## $ first_review                                 <date> 2014-04-19, 2016-04-2...
## $ last_review                                  <date> 2018-12-31, 2018-10-2...
## $ review_scores_rating                         <dbl> 87, 80, 100, NA, 97, 9...
## $ review_scores_accuracy                       <dbl> 9, 10, 10, NA, 10, 10,...
## $ review_scores_cleanliness                    <dbl> 9, 10, 10, NA, 10, 10,...
## $ review_scores_checkin                        <dbl> 9, 10, 10, NA, 10, 10,...
## $ review_scores_communication                  <dbl> 9, 10, 8, NA, 10, 10, ...
## $ review_scores_location                       <dbl> 8, 10, 10, NA, 10, 10,...
## $ review_scores_value                          <dbl> 9, 10, 8, NA, 9, 10, 7...
## $ requires_license                             <lgl> FALSE, FALSE, FALSE, F...
## $ license                                      <lgl> NA, NA, NA, NA, NA, NA...
## $ jurisdiction_names                           <lgl> NA, NA, NA, NA, NA, NA...
## $ instant_bookable                             <lgl> TRUE, TRUE, TRUE, FALS...
## $ is_business_travel_ready                     <lgl> FALSE, FALSE, FALSE, F...
## $ cancellation_policy                          <chr> "strict_14_with_grace_...
## $ require_guest_profile_picture                <lgl> TRUE, TRUE, FALSE, FAL...
## $ require_guest_phone_verification             <lgl> FALSE, FALSE, FALSE, F...
## $ calculated_host_listings_count               <dbl> 2, 2, 2, 1, 1, 1, 16, ...
## $ calculated_host_listings_count_entire_homes  <dbl> 2, 2, 0, 1, 1, 1, 16, ...
## $ calculated_host_listings_count_private_rooms <dbl> 0, 0, 1, 0, 0, 0, 0, 0...
## $ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0...
## $ reviews_per_month                            <dbl> 0.20, 0.04, 0.03, NA, ...
#summary statistics
skimr::skim(listings)
Data summary
Name listings
Number of rows 8986
Number of columns 106
_______________________
Column type frequency:
character 45
Date 5
logical 17
numeric 39
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 8986 0
name 12 1.00 1 217 0 8815 0
summary 409 0.95 1 1000 0 8056 0
space 3216 0.64 1 1000 0 5374 0
description 291 0.97 1 1000 0 8431 0
experiences_offered 0 1.00 4 4 0 1 0
neighborhood_overview 3416 0.62 1 1000 0 4758 0
notes 6163 0.31 1 1000 0 2291 0
transit 3219 0.64 1 1000 0 4950 0
access 4864 0.46 1 1000 0 3416 0
interaction 4336 0.52 1 1000 0 3797 0
house_rules 4852 0.46 1 1000 0 3323 0
picture_url 0 1.00 81 146 0 8796 0
host_url 0 1.00 38 43 0 6233 0
host_name 4 1.00 1 28 0 2888 0
host_location 30 1.00 2 70 0 616 0
host_about 4714 0.48 1 6003 0 2683 11
host_response_time 4 1.00 3 18 0 5 0
host_response_rate 4 1.00 2 4 0 44 0
host_acceptance_rate 4 1.00 2 4 0 99 0
host_thumbnail_url 4 1.00 55 106 0 6211 0
host_picture_url 4 1.00 57 109 0 6211 0
host_neighbourhood 1814 0.80 5 30 0 88 0
host_verifications 0 1.00 2 170 0 210 0
street 0 1.00 11 71 0 191 0
neighbourhood 0 1.00 5 21 0 49 0
neighbourhood_cleansed 0 1.00 5 21 0 19 0
city 71 0.99 3 22 0 65 0
state 104 0.99 4 30 0 42 0
market 12 1.00 5 21 0 6 0
smart_location 0 1.00 11 39 0 78 0
country_code 0 1.00 2 2 0 1 0
country 0 1.00 7 7 0 1 0
property_type 0 1.00 4 22 0 26 0
room_type 0 1.00 10 15 0 4 0
bed_type 0 1.00 5 13 0 5 0
amenities 0 1.00 2 1552 0 8421 0
price 0 1.00 5 9 0 253 0
weekly_price 7833 0.13 6 9 0 215 0
monthly_price 8053 0.10 7 9 0 221 0
security_deposit 3516 0.61 5 9 0 109 0
cleaning_fee 2561 0.72 5 7 0 83 0
extra_people 0 1.00 5 7 0 54 0
calendar_updated 0 1.00 5 13 0 77 0
cancellation_policy 0 1.00 8 27 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2020-06-15 2020-06-19 2020-06-16 4
host_since 4 1.00 2008-08-28 2020-06-14 2015-08-12 2577
calendar_last_scraped 0 1.00 2020-06-15 2020-06-19 2020-06-16 4
first_review 1609 0.82 2010-10-15 2020-06-14 2018-06-22 2040
last_review 1609 0.82 2013-03-06 2020-06-16 2020-01-04 1320

Variable type: logical

skim_variable n_missing complete_rate mean count
thumbnail_url 8986 0 NaN :
medium_url 8986 0 NaN :
xl_picture_url 8986 0 NaN :
host_is_superhost 4 1 0.18 FAL: 7400, TRU: 1582
host_has_profile_pic 4 1 1.00 TRU: 8958, FAL: 24
host_identity_verified 4 1 0.32 FAL: 6114, TRU: 2868
neighbourhood_group_cleansed 8986 0 NaN :
is_location_exact 0 1 0.83 TRU: 7487, FAL: 1499
square_feet 8960 0 0.00 FAL: 26
has_availability 0 1 1.00 TRU: 8986
requires_license 0 1 0.00 FAL: 8986
license 8986 0 NaN :
jurisdiction_names 8986 0 NaN :
instant_bookable 0 1 0.45 FAL: 4954, TRU: 4032
is_business_travel_ready 0 1 0.00 FAL: 8986
require_guest_profile_picture 0 1 0.00 FAL: 8953, TRU: 33
require_guest_phone_verification 0 1 0.01 FAL: 8891, TRU: 95

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.45e+07 1.30e+07 2.35e+03 1.35e+07 2.50e+07 3.67e+07 4.38e+07 <U+2585><U+2585><U+2585><U+2585><U+2587>
scrape_id 0 1.00 2.02e+13 0.00e+00 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 <U+2581><U+2581><U+2587><U+2581><U+2581>
host_id 0 1.00 8.47e+07 9.43e+07 2.58e+03 1.54e+07 4.08e+07 1.31e+08 3.50e+08 <U+2587><U+2582><U+2581><U+2581><U+2581>
host_listings_count 4 1.00 1.19e+01 6.47e+01 0.00e+00 1.00e+00 1.00e+00 3.00e+00 7.58e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
host_total_listings_count 4 1.00 1.19e+01 6.47e+01 0.00e+00 1.00e+00 1.00e+00 3.00e+00 7.58e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
zipcode 195 0.98 1.06e+03 7.04e+01 6.32e+02 1.00e+03 1.05e+03 1.07e+03 3.09e+03 <U+2587><U+2582><U+2581><U+2581><U+2581>
latitude 0 1.00 5.08e+01 2.00e-02 5.08e+01 5.08e+01 5.08e+01 5.09e+01 5.09e+01 <U+2581><U+2582><U+2587><U+2583><U+2581>
longitude 0 1.00 4.36e+00 3.00e-02 4.26e+00 4.35e+00 4.36e+00 4.38e+00 4.49e+00 <U+2581><U+2586><U+2587><U+2582><U+2581>
accommodates 0 1.00 3.01e+00 1.73e+00 1.00e+00 2.00e+00 2.00e+00 4.00e+00 1.60e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
bathrooms 4 1.00 1.18e+00 5.70e-01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 3.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
bedrooms 30 1.00 1.22e+00 8.50e-01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 3.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
beds 99 0.99 1.71e+00 1.29e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 3.10e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
guests_included 0 1.00 1.49e+00 1.02e+00 1.00e+00 1.00e+00 1.00e+00 2.00e+00 1.60e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights 0 1.00 7.40e+00 2.92e+01 1.00e+00 1.00e+00 2.00e+00 3.00e+00 1.00e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights 0 1.00 7.27e+02 5.34e+02 1.00e+00 3.50e+01 1.12e+03 1.12e+03 1.11e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_minimum_nights 0 1.00 7.30e+00 2.91e+01 1.00e+00 1.00e+00 2.00e+00 3.00e+00 1.00e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_minimum_nights 0 1.00 7.66e+00 2.94e+01 1.00e+00 1.00e+00 2.00e+00 4.00e+00 1.00e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_maximum_nights 0 1.00 8.04e+02 5.10e+02 1.00e+00 9.90e+01 1.12e+03 1.12e+03 1.11e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_maximum_nights 0 1.00 8.14e+02 5.05e+02 1.00e+00 1.20e+02 1.12e+03 1.12e+03 1.11e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights_avg_ntm 0 1.00 7.49e+00 2.92e+01 1.00e+00 1.00e+00 2.00e+00 3.00e+00 1.00e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights_avg_ntm 0 1.00 8.09e+02 5.06e+02 1.00e+00 1.20e+02 1.12e+03 1.12e+03 1.11e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
availability_30 0 1.00 1.21e+01 1.30e+01 0.00e+00 0.00e+00 5.00e+00 2.80e+01 3.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2585>
availability_60 0 1.00 2.61e+01 2.62e+01 0.00e+00 0.00e+00 1.60e+01 5.70e+01 6.00e+01 <U+2587><U+2581><U+2581><U+2582><U+2586>
availability_90 0 1.00 4.07e+01 3.92e+01 0.00e+00 0.00e+00 3.60e+01 8.60e+01 9.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2586>
availability_365 0 1.00 1.32e+02 1.41e+02 0.00e+00 0.00e+00 8.70e+01 2.73e+02 3.65e+02 <U+2587><U+2582><U+2582><U+2581><U+2583>
number_of_reviews 0 1.00 3.00e+01 6.14e+01 0.00e+00 1.00e+00 7.00e+00 2.80e+01 7.66e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews_ltm 0 1.00 9.29e+00 1.67e+01 0.00e+00 0.00e+00 2.00e+00 1.00e+01 1.43e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
review_scores_rating 1758 0.80 9.26e+01 8.85e+00 2.00e+01 9.00e+01 9.50e+01 9.90e+01 1.00e+02 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_accuracy 1766 0.80 9.55e+00 8.70e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_cleanliness 1764 0.80 9.22e+00 1.09e+00 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2582><U+2587>
review_scores_checkin 1766 0.80 9.67e+00 7.60e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_communication 1764 0.80 9.64e+00 8.00e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_location 1768 0.80 9.51e+00 7.90e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_value 1767 0.80 9.25e+00 9.50e-01 2.00e+00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
calculated_host_listings_count 0 1.00 5.94e+00 1.44e+01 1.00e+00 1.00e+00 1.00e+00 3.00e+00 8.50e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_entire_homes 0 1.00 4.64e+00 1.36e+01 0.00e+00 0.00e+00 1.00e+00 2.00e+00 8.50e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_private_rooms 0 1.00 1.18e+00 4.76e+00 0.00e+00 0.00e+00 0.00e+00 1.00e+00 5.30e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_shared_rooms 0 1.00 5.00e-02 7.30e-01 0.00e+00 0.00e+00 0.00e+00 0.00e+00 1.50e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
reviews_per_month 1609 0.82 1.31e+00 1.72e+00 1.00e-02 2.10e-01 6.00e-01 1.65e+00 1.33e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>

Overall, there are 106 variables and 8986 observations in the initial data. Among all of those, 39 variables are stored as numbers, including the IDs, host listings counts, review scores ratings,number of reviews, number of rooms and beds, accommodates, number of guests, longitude, latitude etc.

What’s more, we could get the categorical factors from the results above, such as the experiences_offered, host_location, host_is_superhost, host_has_profile_pic, host_identity_verified, neighborhood_cleansed, city, state, market, smart_location, country, is_location exact, property_type, room_type, bed_type, has_availability, requires_license, instant_bookable, is_business_travel_ready, require_guest_profile_picture, require_guest_phone_verification etc. Basically, these are the factors related to the host, the location and other requirements.

1.2 Data wrangling

We notice that all the value for price is stored as characters rather than numbers. Therefore, we change the those values into numbers.

#Data wrangling

#change the price values from strings to numbers
listings <- listings %>% 
  mutate(price = parse_number(price),
         weekly_price=parse_number(weekly_price),
         monthly_price=parse_number(monthly_price),
         security_deposit=parse_number(security_deposit),
         cleaning_fee = parse_number(cleaning_fee),
         extra_people=parse_number(extra_people))

skimr::skim(listings)
Data summary
Name listings
Number of rows 8986
Number of columns 106
_______________________
Column type frequency:
character 39
Date 5
logical 17
numeric 45
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 8986 0
name 12 1.00 1 217 0 8815 0
summary 409 0.95 1 1000 0 8056 0
space 3216 0.64 1 1000 0 5374 0
description 291 0.97 1 1000 0 8431 0
experiences_offered 0 1.00 4 4 0 1 0
neighborhood_overview 3416 0.62 1 1000 0 4758 0
notes 6163 0.31 1 1000 0 2291 0
transit 3219 0.64 1 1000 0 4950 0
access 4864 0.46 1 1000 0 3416 0
interaction 4336 0.52 1 1000 0 3797 0
house_rules 4852 0.46 1 1000 0 3323 0
picture_url 0 1.00 81 146 0 8796 0
host_url 0 1.00 38 43 0 6233 0
host_name 4 1.00 1 28 0 2888 0
host_location 30 1.00 2 70 0 616 0
host_about 4714 0.48 1 6003 0 2683 11
host_response_time 4 1.00 3 18 0 5 0
host_response_rate 4 1.00 2 4 0 44 0
host_acceptance_rate 4 1.00 2 4 0 99 0
host_thumbnail_url 4 1.00 55 106 0 6211 0
host_picture_url 4 1.00 57 109 0 6211 0
host_neighbourhood 1814 0.80 5 30 0 88 0
host_verifications 0 1.00 2 170 0 210 0
street 0 1.00 11 71 0 191 0
neighbourhood 0 1.00 5 21 0 49 0
neighbourhood_cleansed 0 1.00 5 21 0 19 0
city 71 0.99 3 22 0 65 0
state 104 0.99 4 30 0 42 0
market 12 1.00 5 21 0 6 0
smart_location 0 1.00 11 39 0 78 0
country_code 0 1.00 2 2 0 1 0
country 0 1.00 7 7 0 1 0
property_type 0 1.00 4 22 0 26 0
room_type 0 1.00 10 15 0 4 0
bed_type 0 1.00 5 13 0 5 0
amenities 0 1.00 2 1552 0 8421 0
calendar_updated 0 1.00 5 13 0 77 0
cancellation_policy 0 1.00 8 27 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2020-06-15 2020-06-19 2020-06-16 4
host_since 4 1.00 2008-08-28 2020-06-14 2015-08-12 2577
calendar_last_scraped 0 1.00 2020-06-15 2020-06-19 2020-06-16 4
first_review 1609 0.82 2010-10-15 2020-06-14 2018-06-22 2040
last_review 1609 0.82 2013-03-06 2020-06-16 2020-01-04 1320

Variable type: logical

skim_variable n_missing complete_rate mean count
thumbnail_url 8986 0 NaN :
medium_url 8986 0 NaN :
xl_picture_url 8986 0 NaN :
host_is_superhost 4 1 0.18 FAL: 7400, TRU: 1582
host_has_profile_pic 4 1 1.00 TRU: 8958, FAL: 24
host_identity_verified 4 1 0.32 FAL: 6114, TRU: 2868
neighbourhood_group_cleansed 8986 0 NaN :
is_location_exact 0 1 0.83 TRU: 7487, FAL: 1499
square_feet 8960 0 0.00 FAL: 26
has_availability 0 1 1.00 TRU: 8986
requires_license 0 1 0.00 FAL: 8986
license 8986 0 NaN :
jurisdiction_names 8986 0 NaN :
instant_bookable 0 1 0.45 FAL: 4954, TRU: 4032
is_business_travel_ready 0 1 0.00 FAL: 8986
require_guest_profile_picture 0 1 0.00 FAL: 8953, TRU: 33
require_guest_phone_verification 0 1 0.01 FAL: 8891, TRU: 95

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.45e+07 1.30e+07 2.35e+03 1.35e+07 2.50e+07 3.67e+07 4.38e+07 <U+2585><U+2585><U+2585><U+2585><U+2587>
scrape_id 0 1.00 2.02e+13 0.00e+00 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 <U+2581><U+2581><U+2587><U+2581><U+2581>
host_id 0 1.00 8.47e+07 9.43e+07 2.58e+03 1.54e+07 4.08e+07 1.31e+08 3.50e+08 <U+2587><U+2582><U+2581><U+2581><U+2581>
host_listings_count 4 1.00 1.19e+01 6.47e+01 0.00e+00 1.00e+00 1.00e+00 3.00e+00 7.58e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
host_total_listings_count 4 1.00 1.19e+01 6.47e+01 0.00e+00 1.00e+00 1.00e+00 3.00e+00 7.58e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
zipcode 195 0.98 1.06e+03 7.04e+01 6.32e+02 1.00e+03 1.05e+03 1.07e+03 3.09e+03 <U+2587><U+2582><U+2581><U+2581><U+2581>
latitude 0 1.00 5.08e+01 2.00e-02 5.08e+01 5.08e+01 5.08e+01 5.09e+01 5.09e+01 <U+2581><U+2582><U+2587><U+2583><U+2581>
longitude 0 1.00 4.36e+00 3.00e-02 4.26e+00 4.35e+00 4.36e+00 4.38e+00 4.49e+00 <U+2581><U+2586><U+2587><U+2582><U+2581>
accommodates 0 1.00 3.01e+00 1.73e+00 1.00e+00 2.00e+00 2.00e+00 4.00e+00 1.60e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
bathrooms 4 1.00 1.18e+00 5.70e-01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 3.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
bedrooms 30 1.00 1.22e+00 8.50e-01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 3.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
beds 99 0.99 1.71e+00 1.29e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 3.10e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
price 0 1.00 7.79e+01 1.53e+02 0.00e+00 4.00e+01 6.00e+01 8.50e+01 8.94e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
weekly_price 7833 0.13 4.18e+02 2.82e+02 7.00e+01 2.50e+02 3.50e+02 5.00e+02 3.00e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
monthly_price 8053 0.10 1.38e+03 1.03e+03 3.15e+02 7.50e+02 1.10e+03 1.65e+03 9.95e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
security_deposit 3516 0.61 2.15e+02 4.30e+02 0.00e+00 0.00e+00 1.00e+02 2.50e+02 4.74e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
cleaning_fee 2561 0.72 2.76e+01 2.66e+01 0.00e+00 1.00e+01 2.00e+01 4.00e+01 5.05e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
guests_included 0 1.00 1.49e+00 1.02e+00 1.00e+00 1.00e+00 1.00e+00 2.00e+00 1.60e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
extra_people 0 1.00 7.35e+00 1.20e+01 0.00e+00 0.00e+00 0.00e+00 1.20e+01 2.69e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights 0 1.00 7.40e+00 2.92e+01 1.00e+00 1.00e+00 2.00e+00 3.00e+00 1.00e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights 0 1.00 7.27e+02 5.34e+02 1.00e+00 3.50e+01 1.12e+03 1.12e+03 1.11e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_minimum_nights 0 1.00 7.30e+00 2.91e+01 1.00e+00 1.00e+00 2.00e+00 3.00e+00 1.00e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_minimum_nights 0 1.00 7.66e+00 2.94e+01 1.00e+00 1.00e+00 2.00e+00 4.00e+00 1.00e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_maximum_nights 0 1.00 8.04e+02 5.10e+02 1.00e+00 9.90e+01 1.12e+03 1.12e+03 1.11e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_maximum_nights 0 1.00 8.14e+02 5.05e+02 1.00e+00 1.20e+02 1.12e+03 1.12e+03 1.11e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights_avg_ntm 0 1.00 7.49e+00 2.92e+01 1.00e+00 1.00e+00 2.00e+00 3.00e+00 1.00e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights_avg_ntm 0 1.00 8.09e+02 5.06e+02 1.00e+00 1.20e+02 1.12e+03 1.12e+03 1.11e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
availability_30 0 1.00 1.21e+01 1.30e+01 0.00e+00 0.00e+00 5.00e+00 2.80e+01 3.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2585>
availability_60 0 1.00 2.61e+01 2.62e+01 0.00e+00 0.00e+00 1.60e+01 5.70e+01 6.00e+01 <U+2587><U+2581><U+2581><U+2582><U+2586>
availability_90 0 1.00 4.07e+01 3.92e+01 0.00e+00 0.00e+00 3.60e+01 8.60e+01 9.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2586>
availability_365 0 1.00 1.32e+02 1.41e+02 0.00e+00 0.00e+00 8.70e+01 2.73e+02 3.65e+02 <U+2587><U+2582><U+2582><U+2581><U+2583>
number_of_reviews 0 1.00 3.00e+01 6.14e+01 0.00e+00 1.00e+00 7.00e+00 2.80e+01 7.66e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews_ltm 0 1.00 9.29e+00 1.67e+01 0.00e+00 0.00e+00 2.00e+00 1.00e+01 1.43e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
review_scores_rating 1758 0.80 9.26e+01 8.85e+00 2.00e+01 9.00e+01 9.50e+01 9.90e+01 1.00e+02 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_accuracy 1766 0.80 9.55e+00 8.70e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_cleanliness 1764 0.80 9.22e+00 1.09e+00 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2582><U+2587>
review_scores_checkin 1766 0.80 9.67e+00 7.60e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_communication 1764 0.80 9.64e+00 8.00e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_location 1768 0.80 9.51e+00 7.90e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_value 1767 0.80 9.25e+00 9.50e-01 2.00e+00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
calculated_host_listings_count 0 1.00 5.94e+00 1.44e+01 1.00e+00 1.00e+00 1.00e+00 3.00e+00 8.50e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_entire_homes 0 1.00 4.64e+00 1.36e+01 0.00e+00 0.00e+00 1.00e+00 2.00e+00 8.50e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_private_rooms 0 1.00 1.18e+00 4.76e+00 0.00e+00 0.00e+00 0.00e+00 1.00e+00 5.30e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_shared_rooms 0 1.00 5.00e-02 7.30e-01 0.00e+00 0.00e+00 0.00e+00 0.00e+00 1.50e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
reviews_per_month 1609 0.82 1.31e+00 1.72e+00 1.00e-02 2.10e-01 6.00e-01 1.65e+00 1.33e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>

1.3 Select the variables and handling NAs

We select the variables we believe are the most relevant to our analysis from the perspective of hosts, location, property and reviews, and then filter out those could only accommodate 1 because we have 2 persons.

listings <- listings%>%
  
  # we select the variables we believe are the most relevant to our analysis.
  select(#id
         listing_url,
          
         #hosts
         host_id,
         host_since,
         host_response_rate,
         host_response_time,
         host_acceptance_rate,
         host_is_superhost,
         host_has_profile_pic,
         host_identity_verified,
         
         #location
         neighbourhood_cleansed,
         latitude,
         longitude,
         is_location_exact,
         neighbourhood,
         
         #property
         property_type,
         room_type,
         accommodates,
         bathrooms,
         bedrooms,
         beds,
         price,
         security_deposit,
         cleaning_fee,
         guests_included,
         extra_people,
         minimum_nights,
         maximum_nights,
         
         #reviews
         number_of_reviews,
         review_scores_rating,
         review_scores_accuracy,
         review_scores_checkin,
         review_scores_cleanliness,
         review_scores_communication,
         review_scores_location,
         review_scores_value,
         cancellation_policy)%>%
  
  #filter for those accommodate more than 1
  filter(accommodates>1)

skim(listings)
Data summary
Name listings
Number of rows 8203
Number of columns 36
_______________________
Column type frequency:
character 9
Date 1
logical 4
numeric 22
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1 33 37 0 8203 0
host_response_rate 4 1 2 4 0 44 0
host_response_time 4 1 3 18 0 5 0
host_acceptance_rate 4 1 2 4 0 98 0
neighbourhood_cleansed 0 1 5 21 0 19 0
neighbourhood 0 1 5 21 0 49 0
property_type 0 1 4 22 0 22 0
room_type 0 1 10 15 0 4 0
cancellation_policy 0 1 8 27 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 4 1 2008-08-28 2020-06-14 2015-08-13 2521

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 4 1 0.18 FAL: 6754, TRU: 1445
host_has_profile_pic 4 1 1.00 TRU: 8179, FAL: 20
host_identity_verified 4 1 0.32 FAL: 5584, TRU: 2615
is_location_exact 0 1 0.84 TRU: 6863, FAL: 1340

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
host_id 0 1.00 8.44e+07 9.42e+07 2582.00 1.53e+07 4.11e+07 1.30e+08 3.50e+08 <U+2587><U+2582><U+2581><U+2581><U+2581>
latitude 0 1.00 5.08e+01 2.00e-02 50.77 5.08e+01 5.08e+01 5.09e+01 5.09e+01 <U+2581><U+2582><U+2587><U+2583><U+2581>
longitude 0 1.00 4.36e+00 3.00e-02 4.26 4.35e+00 4.36e+00 4.38e+00 4.49e+00 <U+2581><U+2586><U+2587><U+2582><U+2581>
accommodates 0 1.00 3.21e+00 1.69e+00 2.00 2.00e+00 2.00e+00 4.00e+00 1.60e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
bathrooms 2 1.00 1.18e+00 5.70e-01 0.00 1.00e+00 1.00e+00 1.00e+00 3.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
bedrooms 25 1.00 1.24e+00 8.80e-01 0.00 1.00e+00 1.00e+00 2.00e+00 3.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
beds 70 0.99 1.77e+00 1.32e+00 0.00 1.00e+00 1.00e+00 2.00e+00 3.10e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
price 0 1.00 8.02e+01 1.51e+02 0.00 4.50e+01 6.00e+01 8.90e+01 8.94e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
security_deposit 3102 0.62 2.17e+02 4.36e+02 0.00 0.00e+00 1.00e+02 2.50e+02 4.74e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
cleaning_fee 2211 0.73 2.82e+01 2.69e+01 0.00 1.00e+01 2.00e+01 4.00e+01 5.05e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
guests_included 0 1.00 1.53e+00 1.05e+00 1.00 1.00e+00 1.00e+00 2.00e+00 1.60e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
extra_people 0 1.00 7.57e+00 1.18e+01 0.00 0.00e+00 0.00e+00 1.50e+01 2.69e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights 0 1.00 6.74e+00 2.90e+01 1.00 1.00e+00 2.00e+00 3.00e+00 1.00e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights 0 1.00 7.36e+02 5.34e+02 1.00 3.55e+01 1.12e+03 1.12e+03 1.11e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews 0 1.00 3.12e+01 6.27e+01 0.00 1.00e+00 8.00e+00 2.90e+01 7.66e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
review_scores_rating 1507 0.82 9.27e+01 8.67e+00 20.00 9.00e+01 9.50e+01 9.90e+01 1.00e+02 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_accuracy 1515 0.82 9.55e+00 8.60e-01 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_checkin 1515 0.82 9.67e+00 7.50e-01 2.00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_cleanliness 1513 0.82 9.23e+00 1.07e+00 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2582><U+2587>
review_scores_communication 1513 0.82 9.65e+00 7.80e-01 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_location 1517 0.82 9.51e+00 7.80e-01 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_value 1516 0.82 9.26e+00 9.30e-01 2.00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>

As we can see from the results above, there are 2211 missing values in cleaning_fee and 3102 missing values in Security_deposit, which indicates that the cleaning fee and the security deposit are not applicable in those cases. Therefore, we could change the NAs into 0 as follows.

#handling NAs in cleaning fees and security deposit
listings <- listings %>%
  mutate(cleaning_fee = case_when(
    is.na(cleaning_fee) ~ 0, 
    TRUE ~ cleaning_fee),
    security_deposit = case_when(
    is.na(security_deposit) ~ 0, 
    TRUE ~ cleaning_fee))

Also, we found that there are some observations with missing values in bedroom and bathroom. It could be something like a studio with no extra bedroom or bathroom in the property. Therefore, we change those NAs into 0 as well.

#handling NAs in bedrooms and bathrooms
listings <- listings %>%
  mutate(bedrooms = case_when(
    is.na(bedrooms) ~ 0, 
    TRUE ~ bedrooms),
    bathrooms = case_when(
    is.na(bathrooms) ~ 0, 
    TRUE ~ bathrooms))

As for those missing values in host_since and the reviews, we just filter those things out.

#filter out those with NAs in host_since and the review data
listings <- listings %>%
  drop_na(host_since,
          review_scores_rating,
          review_scores_accuracy,
          review_scores_checkin,
          review_scores_cleanliness,
          review_scores_communication,
          review_scores_location,
          review_scores_value)

After that, we also simplified the property types into 5 categories. 4892 of the properties are apartments, which is 73.22% of the total listings in Brussels. There are 529 houses and 314 condominiums on Airbnb as well. In this case, we keep the top 4 types of properties—apartment,house,condominium,townhouse–and classify the other types by a new category called “Others” to construct the simplified property type as follows.

#find the top 4 property types
top_4_p<- listings%>%
  count(property_type)%>%
  arrange(desc(n))%>%
  mutate(proportion=n/sum(n))
top_4_p
property_typenproportion
Apartment48920.732   
House5290.0792  
Condominium3140.047   
Townhouse2650.0397  
Loft2510.0376  
Serviced apartment1970.0295  
Bed and breakfast900.0135  
Guesthouse400.00599 
Guest suite300.00449 
Hotel260.00389 
Boutique hotel150.00225 
Villa90.00135 
Aparthotel70.00105 
Other70.00105 
Cottage20.000299
Nature lodge20.000299
Casa particular (Cuba)10.00015 
Earth house10.00015 
Farm stay10.00015 
Hostel10.00015 
Tiny house10.00015 
#apply simplified property types into 5 categories
listings <- listings %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Apartment","House", "Condominium","Townhouse") ~ property_type, 
    TRUE ~ "Other"))

category_5_p<-listings%>%
  count(prop_type_simplified)%>%
  mutate(prop_type_simplified=factor(prop_type_simplified,order=TRUE,levels = c("Apartment","House", "Condominium","Townhouse","Other")),
         proportion=n/sum(n))
  
ggplot(category_5_p,aes(x=prop_type_simplified,y=n))+
  geom_col(fill="Pink")+
  labs(title="Simplified property type distribution",
       subtitle="Simplified property types on Airbnb in Brussels",
       x="Simplified Property Type",
       y="Numbers")+
  theme_bw()+
  NULL

As for the minimum nights, Airbnb is most commonly used for travel purposes, and we only want to include listings in our regression analysis that are intended for travel purposes. Therefore, we took a look at the most common values for minimum nights in Brussels.

#find the most common values for minimum_night
min_n <- listings%>%
  count(minimum_nights)%>%
  arrange(desc(n))%>%
  mutate(proportion=n/sum(n))

top10_min_n<-min_n%>%
  head(10)

#The most common values for minimum_night
ggplot(top10_min_n,aes(x=reorder(minimum_nights,-n),y=n))+
  geom_col(fill="pink")+
  labs(title="Most common minimum-nights requirements",
       subtitle="Top 10 minimum-nights required to book a property on Airbnb in Brussels",
       x="Value of minimum nights",
       y="Numbers")+
  theme_bw()+
  NULL

#overall density
ggplot(min_n,aes(x=minimum_nights))+
  geom_density(fill="pink",color="red")+
  labs(title="Minimum nights distribution",
       subtitle="minimum nights required to book a property on Airbnb in Brussels",
       x="Minimum nights",
       y="Density")+
  theme_bw()+
  NULL

The most common values for the variable minimum_nights are 1 and 2 days, the majority of them are within 7 days. Among the top 10 most common values, we could see that the minimum required nights of 30 days and 90 days are somehow more common than that of 6 days, which is much longer than what we expected for travel purposes. This probably suggests that Airbnb plays the same role as the housing agents that provide short-rent houses in Brussels. The overall distribution is right skewed as well, which shows the same conclusion.

Since we are going to estimate the cost of 2 people staying for 4 nights, we filter the Airbnb data so that it only includes observations with a minimum nights less than 4 and get rid of the outliers. Then we could get the data that we are going to use in the regression model.

# filter the airbnb data 
listings <- listings%>%
  filter(minimum_nights<=4)

1.4 Statisical summary of variables that we are interested in

skim(listings)
Data summary
Name listings
Number of rows 5670
Number of columns 37
_______________________
Column type frequency:
character 10
Date 1
logical 4
numeric 22
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1 33 37 0 5670 0
host_response_rate 0 1 2 4 0 41 0
host_response_time 0 1 3 18 0 5 0
host_acceptance_rate 0 1 2 4 0 96 0
neighbourhood_cleansed 0 1 5 21 0 19 0
neighbourhood 0 1 5 21 0 49 0
property_type 0 1 4 18 0 20 0
room_type 0 1 10 15 0 4 0
cancellation_policy 0 1 8 27 0 4 0
prop_type_simplified 0 1 5 11 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 0 1 2008-08-28 2020-05-25 2015-08-29 2174

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 0 1 0.21 FAL: 4488, TRU: 1182
host_has_profile_pic 0 1 1.00 TRU: 5658, FAL: 12
host_identity_verified 0 1 0.31 FAL: 3933, TRU: 1737
is_location_exact 0 1 0.85 TRU: 4811, FAL: 859

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
host_id 0 1 8.54e+07 9.28e+07 2582.00 1.58e+07 4.27e+07 1.33e+08 3.48e+08 <U+2587><U+2582><U+2581><U+2581><U+2581>
latitude 0 1 5.08e+01 2.00e-02 50.77 5.08e+01 5.08e+01 5.09e+01 5.09e+01 <U+2581><U+2582><U+2587><U+2583><U+2581>
longitude 0 1 4.36e+00 3.00e-02 4.26 4.34e+00 4.36e+00 4.38e+00 4.47e+00 <U+2581><U+2583><U+2587><U+2582><U+2581>
accommodates 0 1 3.25e+00 1.76e+00 2.00 2.00e+00 2.00e+00 4.00e+00 1.60e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
bathrooms 0 1 1.17e+00 4.60e-01 0.00 1.00e+00 1.00e+00 1.00e+00 1.10e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
bedrooms 0 1 1.22e+00 8.10e-01 0.00 1.00e+00 1.00e+00 1.00e+00 1.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
beds 25 1 1.80e+00 1.33e+00 0.00 1.00e+00 1.00e+00 2.00e+00 1.60e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
price 0 1 7.69e+01 7.89e+01 0.00 4.50e+01 6.00e+01 8.90e+01 3.50e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
security_deposit 0 1 1.77e+01 2.42e+01 0.00 0.00e+00 1.00e+01 3.00e+01 5.05e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
cleaning_fee 0 1 2.08e+01 2.40e+01 0.00 0.00e+00 1.50e+01 3.00e+01 5.05e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
guests_included 0 1 1.61e+00 1.10e+00 1.00 1.00e+00 1.00e+00 2.00e+00 1.60e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
extra_people 0 1 8.53e+00 1.20e+01 0.00 0.00e+00 5.00e+00 1.50e+01 2.50e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights 0 1 1.90e+00 8.80e-01 1.00 1.00e+00 2.00e+00 2.00e+00 4.00e+00 <U+2587><U+2587><U+2581><U+2583><U+2581>
maximum_nights 0 1 7.27e+02 5.34e+02 1.00 3.10e+01 1.12e+03 1.12e+03 1.11e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews 0 1 4.20e+01 7.13e+01 1.00 5.00e+00 1.40e+01 4.50e+01 7.66e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
review_scores_rating 0 1 9.27e+01 8.45e+00 20.00 9.00e+01 9.50e+01 9.80e+01 1.00e+02 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_accuracy 0 1 9.55e+00 8.40e-01 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_checkin 0 1 9.68e+00 7.30e-01 2.00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_cleanliness 0 1 9.23e+00 1.06e+00 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2582><U+2587>
review_scores_communication 0 1 9.66e+00 7.50e-01 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_location 0 1 9.52e+00 7.80e-01 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_value 0 1 9.26e+00 9.00e-01 2.00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>

And we are going to explore the data from the perspectives of host, location, property, reviews and cancellation policy separately.

1.4.1 Host

First of all, we would like to explore how long have the hosts been on Airbnb.

#we could get the latest date of all the scrape in checking the listings_csv file and found it to be "2020-06-19"

#caculate the year of host experience
year_host_experience<- listings%>%
  mutate(year_host=as.numeric(ymd("2020-06-19")-host_since)/365)

#statistical summary of host experience
year_host<-favstats(~year_host, data=year_host_experience)

year_host%>%
  kbl() %>%
  kable_styling()
min Q1 median Q3 max mean sd n missing
0.068 3.04 4.81 6.06 11.8 4.59 2.19 5670 0
#plot the distribution of host experience
ggplot(year_host_experience,aes(x=year_host))+
  geom_density(fill="pink",color="red")+
  labs(title="The majority of the hosts had more than 3 years of experience on Airbnb",
       subtitle="in Brussels",
       x="Years of host experience",
       y="Density")+
  theme_bw()+
  NULL

Overall, the hosts had an average of 4.6 years of experience in hosting on Airbnb. It is slightly left-skewed with a median of 4.81. The majority of hosts had an experience over 3 years.

Then we look at the response time of the hosts. Although this variable is not applicable to 49.9% of the Airbnb hosts, 28.57% of them will response within one hour. And only less than 4% of the hosts have a response time longer than one day, which suggests that the quality of Airbnb hosts in Brussels are high.

#distribution of host response time
listings<-listings%>%
   mutate(host_response_time=factor(host_response_time,order=TRUE,levels=c("within an hour", "within a few hours", "within a day","a few days or more")))

#summary of the response time
res_time <-listings%>%
  group_by(host_response_time)%>%
  summarise(num=n())%>%
  mutate(proportion=num/sum(num))

res_time%>%
  kbl() %>%
  kable_styling()
host_response_time num proportion
within an hour 1620 0.286
within a few hours 565 0.100
within a day 434 0.077
a few days or more 220 0.039
NA 2831 0.499
#plot the distribution of host response time
ggplot(res_time,aes(x=host_response_time,y=num))+
  geom_col(fill="pink")+
  labs(title="Distribution of host response time",
       x="Host response time",
       y="Numbers")+
  theme_bw()+
  NULL

After that, we checked the relationship between host years and host response time to see if those who response quick would have longer host experience on Airbnb. Yet the result didn’t reflect this relationship.

#reorder the host response time in year host experience
year_host_experience<-year_host_experience%>%
    mutate(host_response_time=factor(host_response_time,order=TRUE,levels=c("within an hour", "within a few hours", "within a day","a few days or more")))

#check the relationship between response time and host years
ggplot(year_host_experience,aes(x=host_response_time, y=year_host)) +
  geom_boxplot(fill="pink",color="red") +
  labs(title="Quick response doesn't mean longer years of host experience",
       x="Host response time",
       y="Years of host")+
  theme_bw()+
  NULL

Then we take a look at another factor that could reflected the quality of hosts–whether the host is a super host–and check whether there’s any correlation between super hosts and others. We found that 20.8% of the hosts are super hosts on Airbnb in Brussels. Still we cannot say that super hosts would response faster than other hosts.

#summary of super hosts
super_hosts<- listings%>%
  group_by(host_is_superhost,host_response_time)%>%
  summarise(num=n())%>%
  mutate(proportion_within_the_group=num/sum(num))

super_hosts%>%
  kbl() %>%
  kable_styling()
host_is_superhost host_response_time num proportion_within_the_group
FALSE within an hour 1162 0.259
FALSE within a few hours 391 0.087
FALSE within a day 294 0.066
FALSE a few days or more 189 0.042
FALSE NA 2452 0.546
TRUE within an hour 458 0.387
TRUE within a few hours 174 0.147
TRUE within a day 140 0.118
TRUE a few days or more 31 0.026
TRUE NA 379 0.321
#plot the distribution of host response time
ggplot(super_hosts,aes(x=reorder(host_is_superhost,num),y=num))+
  geom_col(fill="pink")+
  labs(title="There are 1182 super hosts on Airbnb in Brussels",
       x="Host is a super host",
       y="Numbers")+
  theme_bw()+
  NULL

#Plot to see the relationship between super hosts and other hosts
ggplot(listings,aes(x=host_response_time, y=count(host_response_time))) +
  geom_col(fill="pink") +
  facet_wrap(~host_is_superhost,ncol=1)+
  labs(title="Super hosts didn't seem to respond more quickly than other hosts",
       x="Host response time",
       y="Numbers")+
  theme_bw()+
  NULL

1.4.2 Location

Next we take a look at the locations of property. According to the data, there are 19 neighborhoods covered by the listings on Airbnb. And 1891 listings are in Bruxelles, which is 33% of the total listings in Brussels. The top 3 neighborhoods are Bruxelles, Ixelles and Saint-Gilles.

#summary of the neighborhood covered by Airbnb listings in Brussels
su_locations<-listings%>%
  group_by(neighbourhood_cleansed)%>%
  summarise(num=n())%>%
  mutate(proportion=num/sum(num))%>%
  arrange(desc(num))

su_locations%>%
  kbl() %>%
  kable_styling()
neighbourhood_cleansed num proportion
Bruxelles 1891 0.334
Ixelles 933 0.165
Saint-Gilles 622 0.110
Schaerbeek 372 0.066
Etterbeek 280 0.049
Forest 268 0.047
Anderlecht 259 0.046
Uccle 248 0.044
Molenbeek-Saint-Jean 149 0.026
Saint-Josse-ten-Noode 121 0.021
Woluwe-Saint-Lambert 120 0.021
Auderghem 86 0.015
Woluwe-Saint-Pierre 75 0.013
Jette 73 0.013
Watermael-Boitsfort 56 0.010
Evere 39 0.007
Koekelberg 32 0.006
Berchem-Sainte-Agathe 25 0.004
Ganshoren 21 0.004
#plot the number of listings in different neighborhoods
ggplot(su_locations,aes(x=num,y=reorder(neighbourhood_cleansed,num)))+
  geom_col(fill="pink")+
  labs(title="Airbnb listings in different neighbourhoods in Brussels",
       x="Number of listings",
       y=NULL)+
  theme_bw()+
  NULL

And then we take a look at the accuracy of the locations shown on Airbnb. Almost 85% of the listings show the exact same location as it is in Brussels, which suggests that the level of accuracy is quite high.

#summary of the location accuracy
location_accu<-listings%>%
  group_by(is_location_exact)%>%
  summarise(num=n())%>%
  mutate(proportion=num/sum(num))%>%
  arrange(desc(num))

location_accu%>%
  kbl() %>%
  kable_styling()
is_location_exact num proportion
TRUE 4811 0.849
FALSE 859 0.151
#plot the accuracy of locations
ggplot(location_accu,aes(x=reorder(is_location_exact,-num),y=num))+
  geom_col(fill="pink")+
  labs(title="The accuracy of the location shown on Airbnb listings in Brussels",
       x="Is the location exact the same as shown?",
       y="Number of listings")+
  theme_bw()+
  NULL

1.4.3 Property

In this part, we look at the property type, room type and other features.

# summary of simplified property types after filters
category_5_p2<-listings%>%
  count(prop_type_simplified)%>%
  mutate(prop_type_simplified=factor(prop_type_simplified,order=TRUE,levels = c("Apartment","House", "Condominium","Townhouse","Other")),
         proportion=n/sum(n))

#plot the number of listings in different property types
ggplot(category_5_p2,aes(x=prop_type_simplified,y=n))+
  geom_col(fill="Pink")+
  labs(title="The majority of Airbnb listings in Brussels are apartments",
       x="Simplified proper types",
       y="Number of listings")+
  theme_bw()+
  NULL

Then we take a look at the room type. 69% of the listings are entire home/apt and 29% of the listings are private rooms. There are only a few hotel rooms and shared rooms, which shows a high level of overall privacy. Besides

#summary of room types
r_type1<-listings%>%
  group_by(room_type)%>%
  summarise(num=n())%>%
  mutate(proportion=num/sum(num))%>%
  arrange(desc(num))

r_type1%>%
  kbl() %>%
  kable_styling()
room_type num proportion
Entire home/apt 3914 0.690
Private room 1667 0.294
Hotel room 59 0.010
Shared room 30 0.005
#plot the distribution of rooms types
ggplot(r_type1,aes(x=num, y=reorder(room_type,num))) +
  geom_col(fill="pink") +
  labs(title="Airbnb listings in Brussels has a high level of privacy",
       subtitle = "distribution of different room types",
       x="Number of listings",
       y=NULL)+
  theme_bw()+
  NULL

Also we could see the distribution of bedrooms and bathrooms among the Airbnb listings in Brussels. The most common value for both bedrooms and bathrooms is 1.

#bedrooms
#statistical summary of bedrooms
bed_r<-favstats(~bedrooms, data=listings)

bed_r%>%
  kbl() %>%
  kable_styling()
min Q1 median Q3 max mean sd n missing
0 1 1 1 10 1.22 0.808 5670 0
#plot the distribution of bedrooms
ggplot(listings,aes(x=bedrooms))+
  geom_density(fill="pink",color="red")+
  labs(title="The majority of the listings have one bedroom",
       x="Number of bedrooms",
       y="Density")+
  scale_x_continuous(breaks = seq(0,10,1), limits = c(0,10))+
  theme_bw()+
  NULL

#bathrooms
#statistical summary of bathrooms
bath_r<-favstats(~bathrooms, data=listings)

bath_r%>%
  kbl() %>%
  kable_styling()
min Q1 median Q3 max mean sd n missing
0 1 1 1 11 1.17 0.461 5670 0
#plot the distribution of bathrooms
ggplot(listings,aes(x=bathrooms))+
  geom_density(fill="pink",color="red")+
  labs(title="The majority of the listings have one bathroom",
       x="Number of bathrooms",
       y="Density")+
  scale_x_continuous(breaks = seq(0,10,1), limits = c(0,10))+
  theme_bw()+
  NULL

The average accommodates is 3.25 persons, and over 50% of listings accommodates for exact 2 people. Besides, those could accommodate even number of people are more than those accommodates odd number of people.

#accommodates
#statistical summary of accommodates
acco_n<-favstats(~accommodates, data=listings)

acco_n%>%
  kbl() %>%
  kable_styling()
min Q1 median Q3 max mean sd n missing
2 2 2 4 16 3.25 1.76 5670 0
#plot the distribution of accommodates
ggplot(listings,aes(x=accommodates))+
  geom_density(fill="pink",color="red")+
  labs(title="Over 50% of listings accommodates 2 people",
       x="Accommodates",
       y="Density")+
  scale_x_continuous(breaks = seq(0,16,2), limits = c(0,16))+
  theme_bw()+
  NULL

Overall, the average price is 76.9 dollars and the median price is 60 dollars. But the maximum price reached 3500 dollars, which led to a positive skewed price distribution.

#price
#statistical summary of price
summary_price<-favstats(~price, data=listings)

summary_price%>%
  kbl() %>%
  kable_styling()
min Q1 median Q3 max mean sd n missing
0 45 60 89 3500 76.9 78.9 5670 0
#plot the distribution of price
ggplot(listings,aes(x=price))+
  geom_density(fill="pink",color="red")+
  labs(title="Distribution of price is skewed right",
       x="Price",
       y="Density")+
  theme_bw()+
    scale_x_continuous(breaks = seq(0,1000,50), limits = c(0,1000))+
  NULL

1.4.4 Reviews

The average number of reviews is 42, while the median is only 14, which suggests a right-skewed distribution. Meanwhile, the maximum number of reviews is 766, which is far beyond the average.

#statistical summary of number_of_reviews
su_review_numbers<-favstats(~number_of_reviews, data=listings)

su_review_numbers%>%
  kbl() %>%
  kable_styling()
min Q1 median Q3 max mean sd n missing
1 5 14 45 766 42 71.3 5670 0
#plot the distribution of number_of_reviews
ggplot(listings,aes(x=number_of_reviews))+
  geom_density(fill="pink",color="red")+
  labs(title="Distribution of number of reviews is skewed right",
       x="numbers of reviews",
       y="Density")+
  theme_bw()+
    scale_x_continuous(breaks = seq(0,800,100), limits = c(0,800))+
  NULL

The average rating score is 92.7 and the median rating score is 95, which suggests high satisfaction from customers. The distribution of rating scores is skewed left.

#statistical summary of reviews_scores_rating
su_review_rating<-favstats(~review_scores_rating, data=listings)

su_review_rating%>%
  kbl() %>%
  kable_styling()
min Q1 median Q3 max mean sd n missing
20 90 95 98 100 92.7 8.45 5670 0
#plot the distribution of reviews_scores_rating
ggplot(listings,aes(x=review_scores_rating))+
  geom_density(fill="pink",color="red")+
  labs(title="Distribution of rating scores is skewed left",
       x="Rating scores",
       y="Density")+
  theme_bw()+
    scale_x_continuous(breaks = seq(0,100,25), limits = c(0,100))+
  NULL

To break down the rating scores, we also take a look at the scores on different aspects.

#plot the distribution of different rating scores
# plot the distribution of scores on accuracy
p_accuracy <- listings%>%
  ggplot(aes(x=review_scores_accuracy)) +
  geom_density(fill="pink",color="red") +
  labs(x = "review score") +
  ggtitle("Accuracy") +
  theme_bw()+
  NULL


# plot the distribution of scores on checkin process
p_checkin <- listings%>%
  ggplot(aes(x=review_scores_checkin)) +
  geom_density(fill="pink",color="red") +
  labs(x = "review score") +
  ggtitle(" Checkin") +
  theme_bw()+
  NULL


# plot the distribution of scores on cleanliness
p_cleanliness <- listings%>%
  ggplot(aes(x=review_scores_cleanliness)) +
  geom_density(fill="pink",color="red") +
  labs(x = "review score") +
  ggtitle("Cleanliness") +
  theme_bw()+
  NULL

# plot the distribution of scores on communication
p_communication <- listings%>%
  ggplot(aes(x=review_scores_communication)) +
  geom_density(fill="pink",color="red") +
  labs(x = "review score") +
  ggtitle("Communication") +
  theme_bw()+
  NULL

# plot the distribution of scores on location
p_location <- listings%>%
  ggplot(aes(x=review_scores_location)) +
  geom_density(fill="pink",color="red") +
  labs(x = "review score") +
  ggtitle("Location") +
  theme_bw()+
  NULL

# plot the distribution of scores on value
p_value <- listings%>%
  ggplot(aes(x=review_scores_value)) +
  geom_density(fill="pink",color="red") +
  labs(x = "review score") +
  ggtitle("Value") +
  theme_bw()+
  NULL

#put the plots together
library(patchwork)
combine1<- p_accuracy+p_cleanliness+p_checkin+p_communication+p_location+p_value
combine1

1.4.5 Cancellation_policy

Basically, the cancellation policy in Airbnb listings in Brussels is proportioned evenly. About 37% of the listings have a flexible cancellation policy, and around 31% of the listings have a moderate cancellation policy and the rest have a strict policy.

#summary of the cancellation policies
su_cancel<-listings%>%
  group_by(cancellation_policy)%>%
  summarise(num=n())%>%
  mutate(proportion=num/sum(num),
         cancellation_policy=factor(cancellation_policy,ordered = TRUE,levels = c("flexible","moderate","strict_14_with_grace_period","super_strict_30")))

su_cancel%>%
  kbl() %>%
  kable_styling()
cancellation_policy num proportion
flexible 2136 0.377
moderate 1750 0.309
strict_14_with_grace_period 1781 0.314
super_strict_30 3 0.001
#plot the number of listings in different cancellation policies
ggplot(su_cancel,aes(x=cancellation_policy,y=num))+
  geom_col(fill="pink")+
  labs(title="Airbnb listings with different cancellation policies in Brussels",
       x="Cancellation policy",
       y="Number of listings")+
  theme_bw()+
  NULL

1.4.6 Compute the dependent variable: Price_4_nights

we create the price_4_nights variable using the case_when() function. Our conditional statement says that if the number of guests included in the price is greater than or equal to 2, the price is just the cleaning fee plus 4* the nightly rate, since the 2 people will stay for 4 nights. However, if the number of guests included in the price is less than 2, meaning that a party of 2 would have to pay for exactly 1 extra guest, then the price will be 4* the nightly price, which is the base price plus the price for that extra guest, plus the cleaning fee.

The average cost of 2 people staying for 4 nights is 339 dollars and the overall distribution is right-skewed.

#compute the price_4_nights variable 
year_host_experience <- year_host_experience %>% 
  mutate(price_4_nights = case_when(
    guests_included >= 2 ~ cleaning_fee + 4*price,
    TRUE ~ 4 * (price + extra_people) +cleaning_fee))

#see the summary of price_4_nights
kbl(skim(year_host_experience$price_4_nights))%>%
  kable_styling()
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric data 0 1 339 322 36 200 280 380 14000 <U+2587><U+2581><U+2581><U+2581><U+2581>
ggplot(year_host_experience,aes(x=price_4_nights))+
  geom_density(fill="pink",color="red")+
  labs(title="Distribution of price for 2 persons staying 4 nights",
       x="Price for 4 nights",
       y="Density")+
  theme_bw()+
  scale_x_continuous(breaks = seq(0,1000,500), limits = c(0,1000))+
  NULL

1.5 Correlation between variables

To take a look at how the numerical variables are correlated, we use the “ggpair()” function and get the following graphs.

correlation1<- year_host_experience%>%
  #reviews
  select(review_scores_rating, 
         review_scores_accuracy, 
         review_scores_cleanliness, 
         review_scores_checkin, 
         review_scores_communication, 
         review_scores_location, 
         review_scores_value
         ) %>% 
  ggpairs(alpha=0.5) +
  theme_bw()

correlation1

The review scores are highly correlated to each other with correlation coefficient over 0.4, so we should avoid using them all, and maybe just use one variable as the representative of the reviews.

correlation2<- year_host_experience%>%
  #property features and price
  select(accommodates,
         bathrooms,
         bedrooms,
         security_deposit,
         cleaning_fee,
         guests_included,
         extra_people,
         minimum_nights,
         maximum_nights,
         price_4_nights) %>% 
  ggpairs(alpha=0.5) +
  theme_bw()

correlation2

Based on the above correlation matrix, the number of bedrooms,bathrooms and accommodates are highly correlated with correlation factors over 0.4. Therefore, we would avoid using all three of them in our regression model. Also, we found that the cleaning fee and security deposit are highly correlated to each other, and the correlation is almost linear according to the scatter plot above. Other variables didn’t show obvious linear correlations, therefore we’d like to group some of the variables and see if there’s any correlations between the factors. We started from the potential independent variables.

First, we would like to start from the reviews and explore the relationship between number of reviews and score ratings. We found that the median rating score goes down as the quartile group goes up. However, the average score in bucket 4 is higher than that in bucket 1. After testing the difference of average rating score in bucket 1 and bucket 4, we found that the difference in mean is statistically significant at 95% level. Besides, because the number is highly right-skewed, the sample size is much bigger in quartile group 4 and the result has smaller standard deviation as well. But there’s no linear correlation between these 2 variables.

#Relationship between number of reviews and the score ratings

##select the data
reviewData <- year_host_experience %>% 
  select(review_scores_rating,number_of_reviews)
glimpse(reviewData)
## Rows: 5,670
## Columns: 2
## $ review_scores_rating <dbl> 87, 80, 100, 98, 90, 93, 97, 87, 91, 92, 93, 1...
## $ number_of_reviews    <dbl> 15, 2, 3, 12, 4, 59, 6, 48, 25, 33, 3, 1, 45, ...
kbl(skim(reviewData))%>%
  kable_styling()
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric review_scores_rating 0 1 92.7 8.45 20 90 95 98 100 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric number_of_reviews 0 1 42.0 71.31 1 5 14 45 766 <U+2587><U+2581><U+2581><U+2581><U+2581>
##We group the number of reviews by its quartile
reviewData <- reviewData %>%
  mutate(buckets = ntile(number_of_reviews,4))

#check the quartile of number of reviews
reviewData2 <- reviewData %>%
  group_by(buckets)%>%
  summarize(minimum=min(number_of_reviews),maximum=max(number_of_reviews))
reviewData2%>%
  kbl()%>%
  kable_styling()
buckets minimum maximum
1 1 5
2 5 14
3 14 45
4 45 766
#change buckets into factors
reviewData$buckets<-as.factor(reviewData$buckets)

#plot rating scores in different groups 
ggplot(reviewData,aes(x=review_scores_rating,y=buckets)) + 
  geom_boxplot(fill="pink",color="red")+
  labs(title="Rating scores are more stable in those with higher number of reviews",
       x="Rating scores",
       y="Quartiles group of review numbers")+
  theme_bw()+
  NULL

# test the difference in bucket 1 and bucket 4 
redata<-reviewData%>%
  filter(buckets!="2",
         buckets!="3")
t.test(review_scores_rating ~ buckets, data=redata)
## 
##  Welch Two Sample t-test
## 
## data:  review_scores_rating by buckets
## t = -4, df = 1743, p-value = 3e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.298 -0.831
## sample estimates:
## mean in group 1 mean in group 4 
##            91.7            93.2

Then we would like to explore the relationship between hosts’ experience and the review scores. Also we tried to form 4 quartile groups based on the host’s experience. The average score of bucket 4 is slightly higher than that of bucket 1. However, there isn’t any linear correlation between these factors as well.

#Relationship between score ratings and year of hosts' experience

##select the data
rating_host_y <- year_host_experience %>% 
  select(review_scores_rating,year_host)
glimpse(rating_host_y)
## Rows: 5,670
## Columns: 2
## $ review_scores_rating <dbl> 87, 80, 100, 98, 90, 93, 97, 87, 91, 92, 93, 1...
## $ year_host            <dbl> 11.82, 11.82, 9.86, 9.39, 9.28, 9.28, 9.28, 9....
kbl(skim(rating_host_y))%>%
  kable_styling()
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric review_scores_rating 0 1 92.72 8.45 20.000 90.00 95.00 98.00 100.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric year_host 0 1 4.59 2.19 0.068 3.04 4.81 6.06 11.8 <U+2585><U+2586><U+2587><U+2582><U+2581>
##We group the year of hosts by its quartile
rating_host_y <- rating_host_y %>%
  mutate(buckets = ntile(year_host,4))

#check the quartile of years of hosts
rating_host_y2 <- rating_host_y %>%
  group_by(buckets)%>%
  summarize(minimum=min(year_host),maximum=max(year_host))
rating_host_y2%>%
  kbl()%>%
  kable_styling()
buckets minimum maximum
1 0.068 3.04
2 3.038 4.81
3 4.811 6.06
4 6.060 11.82
#change buckets into factors
rating_host_y$buckets<-as.factor(rating_host_y$buckets)

#plot rating scores in different groups
ggplot(rating_host_y,aes(x=review_scores_rating,y=buckets)) + 
  geom_boxplot(fill="pink",color="red")+
  labs(title="Rating scores didn't vary a lot between quartile group of host years",
       x="Rating scores",
       y="Quartiles group of host years")+
  theme_bw()+
  NULL

# test the difference in bucket 1 and bucket 4 
rating_host<-rating_host_y%>%
  filter(buckets!="2",
         buckets!="3")
t.test(review_scores_rating ~ buckets, data=rating_host)
## 
##  Welch Two Sample t-test
## 
## data:  review_scores_rating by buckets
## t = -6, df = 2314, p-value = 4e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.60 -1.36
## sample estimates:
## mean in group 1 mean in group 4 
##            91.7            93.7

Then we check the relationship between neighborhood and the reviews. No obvious linear correlations are found, yet the standard deviation of the number of reviews and the rating score is slightly different between different neighborhood.

#neighborhood vs. number of reviews
ggplot(listings,aes(x=number_of_reviews,y=reorder(neighbourhood_cleansed,number_of_reviews,median)))+
  geom_boxplot(fill="pink",color="red")+
  labs(title="Number of reviews on listings in different neighbourhoods in Brussels",
       x="Number of reviews",
       y=NULL)+
  theme_bw()+
  NULL

#neighborhood vs. rating scores
ggplot(listings,aes(x=review_scores_rating, y=reorder(neighbourhood_cleansed,review_scores_rating),median))+
  geom_boxplot(fill="pink",color="red")+
  labs(title="Rating scores on listings in different neighbourhoods in Brussels",
       x="Rating scores",
       y=NULL)+
  theme_bw()+
  NULL

Then we check the relationship between simplified property types and the reviews.

#simplified property type vs. number of reviews
ggplot(listings,aes(x=number_of_reviews, y=reorder(prop_type_simplified,number_of_reviews)))+
  geom_boxplot(fill="pink",color="red")+
  labs(title="Number of reviews didn't vary much across different property types",
       x="Number of reviews",
       y=NULL)+
  theme_bw()+
  NULL

#simplified property type vs. rating scores
ggplot(listings,aes(x=review_scores_rating, y=reorder(prop_type_simplified,review_scores_rating)))+
  geom_boxplot(fill="pink",color="red")+
  labs(title="Rating scores didn't vary much across different property types",
       x="Rating scores",
       y=NULL)+
  theme_bw()+
  NULL

What’s more, we examine the relationship between simplified property types and the neighborhood. Apartments are the most common property type in almost every neighborhoods in Brussels.

prop_type1<-listings%>%
  group_by(neighbourhood_cleansed,prop_type_simplified)%>%
  summarise(num=n())%>%
  mutate(proportion=num/sum(num))%>%
  arrange(desc(num))

prop_type1%>%
  kbl() %>%
  kable_styling()
neighbourhood_cleansed prop_type_simplified num proportion
Bruxelles Apartment 1375 0.727
Ixelles Apartment 758 0.812
Saint-Gilles Apartment 483 0.777
Bruxelles Other 283 0.150
Schaerbeek Apartment 254 0.683
Forest Apartment 210 0.784
Etterbeek Apartment 209 0.746
Anderlecht Apartment 165 0.637
Uccle Apartment 140 0.565
Bruxelles Condominium 98 0.052
Saint-Josse-ten-Noode Apartment 98 0.810
Molenbeek-Saint-Jean Apartment 97 0.651
Woluwe-Saint-Lambert Apartment 92 0.767
Bruxelles House 77 0.041
Uccle House 62 0.250
Ixelles House 61 0.065
Anderlecht Other 60 0.232
Bruxelles Townhouse 58 0.031
Ixelles Other 54 0.058
Jette Apartment 50 0.685
Auderghem Apartment 45 0.523
Saint-Gilles House 45 0.072
Saint-Gilles Other 45 0.072
Woluwe-Saint-Pierre Apartment 43 0.573
Schaerbeek Other 41 0.110
Schaerbeek House 37 0.099
Saint-Gilles Townhouse 34 0.055
Ixelles Townhouse 32 0.034
Molenbeek-Saint-Jean Other 29 0.195
Ixelles Condominium 28 0.030
Etterbeek House 27 0.096
Evere Apartment 25 0.641
Koekelberg Apartment 25 0.781
Forest House 23 0.086
Auderghem House 22 0.256
Schaerbeek Townhouse 22 0.059
Etterbeek Other 21 0.075
Watermael-Boitsfort Apartment 21 0.375
Forest Other 20 0.075
Schaerbeek Condominium 18 0.048
Watermael-Boitsfort House 18 0.321
Anderlecht Condominium 17 0.066
Anderlecht House 16 0.062
Uccle Condominium 16 0.065
Uccle Townhouse 16 0.065
Woluwe-Saint-Pierre House 16 0.213
Saint-Gilles Condominium 15 0.024
Ganshoren Apartment 14 0.667
Uccle Other 14 0.056
Etterbeek Townhouse 13 0.046
Woluwe-Saint-Lambert House 13 0.108
Berchem-Sainte-Agathe Apartment 12 0.480
Jette House 12 0.164
Watermael-Boitsfort Other 12 0.214
Molenbeek-Saint-Jean Condominium 11 0.074
Etterbeek Condominium 10 0.036
Forest Townhouse 9 0.034
Saint-Josse-ten-Noode House 9 0.074
Auderghem Other 8 0.093
Auderghem Townhouse 8 0.093
Saint-Josse-ten-Noode Other 8 0.066
Molenbeek-Saint-Jean House 7 0.047
Woluwe-Saint-Pierre Condominium 7 0.093
Evere Condominium 6 0.154
Forest Condominium 6 0.022
Woluwe-Saint-Lambert Townhouse 6 0.050
Woluwe-Saint-Pierre Other 6 0.080
Berchem-Sainte-Agathe House 5 0.200
Jette Condominium 5 0.068
Molenbeek-Saint-Jean Townhouse 5 0.034
Woluwe-Saint-Lambert Other 5 0.042
Berchem-Sainte-Agathe Other 4 0.160
Evere House 4 0.103
Evere Other 4 0.103
Saint-Josse-ten-Noode Condominium 4 0.033
Watermael-Boitsfort Townhouse 4 0.071
Woluwe-Saint-Lambert Condominium 4 0.033
Auderghem Condominium 3 0.035
Berchem-Sainte-Agathe Townhouse 3 0.120
Ganshoren House 3 0.143
Jette Other 3 0.041
Jette Townhouse 3 0.041
Koekelberg Townhouse 3 0.094
Woluwe-Saint-Pierre Townhouse 3 0.040
Ganshoren Condominium 2 0.095
Koekelberg Other 2 0.062
Saint-Josse-ten-Noode Townhouse 2 0.017
Anderlecht Townhouse 1 0.004
Berchem-Sainte-Agathe Condominium 1 0.040
Ganshoren Other 1 0.048
Ganshoren Townhouse 1 0.048
Koekelberg Condominium 1 0.031
Koekelberg House 1 0.031
Watermael-Boitsfort Condominium 1 0.018
#plot the distribution of rooms types
ggplot(prop_type1,aes(x=num,y=reorder(neighbourhood_cleansed,num),fill=prop_type_simplified)) +
  geom_bar(position="dodge",stat="identity") +
  labs(title="Distribution of different property types",
       subtitle = "across different neighbourhoods in Brussels",
       x="Number of listings",
       y=NULL)+
  theme_bw()+
  theme(legend.title = element_blank())+
  NULL

1.5.1 Correlation between price_4_nights and other variables in particular

Since we have to study how the other variable could impact the total cost of staying, we decided to explore the correlation between price_4_nights and the other potential variables.

First we start with the cost and the property types. Apart from the “Other” category, the average costs to stay in a house is 365 dollars, while that for an apartment is 322 dollars. However, we found that the average price of apartment is not statistically significant than that of houses after testing the difference.

#compute summary of price of different property types
sum_price_property<-year_host_experience%>%
  group_by(prop_type_simplified)%>%
  summarise(avg_price4=mean(price_4_nights),
            median_price4=median(price_4_nights),
            sd_p4=sd(price_4_nights),
            max_p4=max(price_4_nights),
            min_p4=max(price_4_nights))

#show the summary
sum_price_property%>%
  kbl()%>%
  kable_styling()
prop_type_simplified avg_price4 median_price4 sd_p4 max_p4 min_p4
Apartment 322 274 239 4120 4120
Condominium 339 303 174 1200 1200
House 365 235 703 14000 14000
Other 443 360 394 4910 4910
Townhouse 306 220 255 1930 1930
#plot the box plot based on simplified property type
ggplot(year_host_experience,aes(x=prop_type_simplified,y=price_4_nights))+
  geom_boxplot(fill="pink",color="red")+
  labs(title="Price for 4 nights in different property types",
       x=NULL,
       y="Price for 4 nights")+
  scale_y_continuous(breaks = c(0,1000,500),limits = c(0,1000))+
  theme_bw()+
  NULL

# test the difference in apartment and house
test_prop_price<-year_host_experience%>%
  filter(prop_type_simplified %in% c("Apartment","House"))

t.test(price_4_nights ~ prop_type_simplified, data=test_prop_price)
## 
##  Welch Two Sample t-test
## 
## data:  price_4_nights by prop_type_simplified
## t = -1, df = 469, p-value = 0.2
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -107.5   22.5
## sample estimates:
## mean in group Apartment     mean in group House 
##                     322                     365

We also examined the relationship between price and room types. After testing the difference between average price in each group, we found that hotel rooms are more expensive than the entire home or apartment, and the difference is statistically significant at 95% level. Meanwhile, the average price of private room is higher than that of shared room, and the difference is statistically significant at 90% level.

#compute summary of price of different room types
sum_price_room<-year_host_experience%>%
  group_by(room_type)%>%
  summarise(avg_price4=mean(price_4_nights),
            median_price4=median(price_4_nights),
            sd_p4=sd(price_4_nights),
            max_p4=max(price_4_nights),
            min_p4=max(price_4_nights))

#show the summary
sum_price_room%>%
  kbl()%>%
  kable_styling()
room_type avg_price4 median_price4 sd_p4 max_p4 min_p4
Entire home/apt 380 320 246 4120 4120
Hotel room 662 448 686 4910 4910
Private room 235 189 419 14000 14000
Shared room 197 172 116 667 667
#plot the box plot based on simplified property type
ggplot(year_host_experience,aes(x=room_type,y=price_4_nights))+
  geom_boxplot(fill="pink",color="red")+
  labs(title="Price for 4 nights varies in different room types",
       x=NULL,
       y="Price for 4 nights")+
  scale_y_continuous(breaks = c(0,1000,500),limits = c(0,1000))+
  theme_bw()+
  NULL

# test the difference in hotel room and entire home/apt
test_room_price1<-year_host_experience%>%
  filter(room_type %in% c("Entire home/apt","Hotel room"))
t.test(price_4_nights ~ room_type, data=test_room_price1)
## 
##  Welch Two Sample t-test
## 
## data:  price_4_nights by room_type
## t = -3, df = 58, p-value = 0.003
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -461 -103
## sample estimates:
## mean in group Entire home/apt      mean in group Hotel room 
##                           380                           662
# test the difference in private room and shared home
test_room_price1<-year_host_experience%>%
  filter(room_type %in% c("Private room","Shared room"))
t.test(price_4_nights ~ room_type, data=test_room_price1)
## 
##  Welch Two Sample t-test
## 
## data:  price_4_nights by room_type
## t = 2, df = 44, p-value = 0.1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.95 86.08
## sample estimates:
## mean in group Private room  mean in group Shared room 
##                        235                        197

Then we started with the cost and the reviews by creating scatter plots. We found that the price didn’t goes up as the number of reviews goes up. Maybe people reviews more on the property with not that high price on Airbnb.

On the other hand, we noticed that the maximum price goes up when the ratings goes up, which somehow suggests that people might tend to pay higher price for those with higher rating.

#examine the relationship between price and number of reviews
ggplot(year_host_experience,aes(x=number_of_reviews,y=price_4_nights,color="red"))+
  geom_point()+
  labs(title="Price for 4 nights change as reviews numbers goes up",
       x=NULL,
       y="Price for 4 nights")+
  scale_y_continuous(breaks = c(0,5000,500),limits = c(0,5000))+
  theme_bw()+
  NULL

#examine the relationship between price and rating scores
ggplot(year_host_experience,aes(x=review_scores_rating,y=price_4_nights,color="red"))+
  geom_point(show.legend = FALSE)+
  labs(title="Maximum price for 4 nights goes up as rating scores goes up",
       x=NULL,
       y="Price for 4 nights")+
  scale_y_continuous(breaks = c(0,5000,500),limits = c(0,5000))+
  theme_bw()+
  NULL

However, the average price didn’t change as the rating scores goes up. We also formed 4 groups based on the quartile of ratings scores and and found the difference between the average costs is not statistically significant.

##We group the rating scores by its quartile
rating_price <- year_host_experience %>%
  mutate(buckets = ntile(review_scores_rating,4))

#check the quartile of rating scores
rating_price2 <- rating_price %>%
  group_by(buckets)%>%
  summarize(minimum=min(review_scores_rating),maximum=max(review_scores_rating))
rating_price2%>%
  kbl()%>%
  kable_styling()
buckets minimum maximum
1 20 90
2 90 95
3 95 98
4 98 100
#change buckets into factors
rating_price$buckets<-as.factor(rating_price$buckets)

#plot price for 4 nights in different groups
ggplot(rating_price,aes(x=price_4_nights,y=buckets)) + 
  geom_boxplot(fill="pink",color="red")+
  labs(title="Average price didn't vary a lot between different groups of rating scores",
       x="Price for 4 nights",
       y="Quartiles group of rating scores")+
  scale_x_continuous(breaks = c(0,1000,500),limits=c(0,1000))+
  theme_bw()+
  NULL

# test the difference in bucket 1 and bucket 4 
rating_price<-rating_price%>%
  filter(buckets!="2",
         buckets!="3")
t.test(price_4_nights ~ buckets, data=rating_price)
## 
##  Welch Two Sample t-test
## 
## data:  price_4_nights by buckets
## t = -1, df = 2821, p-value = 0.3
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -30.75   9.25
## sample estimates:
## mean in group 1 mean in group 4 
##             336             347

Next, we would like to see whether the hosting experience is correlated to the total price. According to the following graphs, we found that the price didn’t vary a lot between experienced hosts and new hosts. Besides, listings with a super host may have a higher average price than others, and the difference is statistically significant at 90% level.

#examine the relationship between price and years of host
ggplot(year_host_experience,aes(x=year_host,y=price_4_nights,color="red"))+
  geom_point()+
  labs(title="Price for 4 nights didn't vary a lot between experienced hosts and new hosts",
       x="years of hosting experience",
       y="Price for 4 nights")+
  scale_y_continuous(breaks = c(0,1000,500),limits = c(0,1000))+
  theme_bw()+
  NULL

#examine the relationship between price and super hosts
ggplot(year_host_experience,aes(x=host_is_superhost,y=price_4_nights))+
  geom_boxplot(fill="pink",color="red")+
  labs(title="Price for 4 nights goes up as rating scores goes up",
       x=NULL,
       y="Price for 4 nights")+
  scale_y_continuous(breaks = c(0,1000,500),limits = c(0,1000))+
  theme_bw()+
  NULL

#test the significance of the difference
t.test(price_4_nights ~ host_is_superhost, data=year_host_experience)
## 
##  Welch Two Sample t-test
## 
## data:  price_4_nights by host_is_superhost
## t = -2, df = 2689, p-value = 0.1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -29.33   3.81
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##                 337                 349

After that, we examine the relationship between price and cancellation policy. From the graphs below, we could see that listings with stricter cancellation policies cost slightly more in average than those with more flexible policies. This could be related to how popular the listings are. Strict cancellation policy might suggest that the listing is more popular in the market, thus the price would be a little bit higher.

#examine the relationship between price and cancellation policies
ggplot(year_host_experience,aes(x=cancellation_policy,y=price_4_nights))+
  geom_boxplot(fill="pink",color="red")+
  labs(title="Price for 4 nights goes up as policy becomes stricter",
       x="Cancellation policy",
       y="Price for 4 nights")+
  scale_y_continuous(breaks = c(0,1000,500),limits = c(0,1000))+
  theme_bw()+
  NULL

Finally, we examine the correlation between price and location. The median of total costs slightly varied among different neiborhood. Bruxelles has the highest median costs and the difference between its median cost and that of Evere is statistically significant at 95% level.

#price_4_nights in different neighborhood
ggplot(year_host_experience,aes(x=price_4_nights,y=reorder(neighbourhood_cleansed,price_4_nights,median)))+
  geom_boxplot(fill="pink",color="red")+
  labs(title="Distribution of price for 4 nights in different neighbourhoods in Brussels",
       x="Price for 4 nights",
       y=NULL)+
  scale_x_continuous(breaks = c(0,2000,500),limits = c(0,2000))+
  theme_bw()+
  NULL

# test the difference of average price in Bruxelles and Evere
test_neighbor_price<-year_host_experience%>%
  filter(neighbourhood_cleansed %in% c("Bruxelles","Evere"))
t.test(price_4_nights ~ neighbourhood_cleansed, data=test_neighbor_price)
## 
##  Welch Two Sample t-test
## 
## data:  price_4_nights by neighbourhood_cleansed
## t = 7, df = 49, p-value = 2e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   97.5 181.9
## sample estimates:
## mean in group Bruxelles     mean in group Evere 
##                     395                     255

2 Mappings

leaflet(data = filter(year_host_experience, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

3 Regression Analysis

#Data used for regression
listings2<- year_host_experience %>%
  select(price,
         cleaning_fee,
         extra_people,
         property_type,
         room_type,
         number_of_reviews,
         review_scores_rating,
         longitude,
         latitude,
         neighbourhood,
         guests_included,
         host_is_superhost,
         beds,
         cancellation_policy, 
         is_location_exact,
         bedrooms,
         bathrooms,
         accommodates,
         neighbourhood_cleansed, 
         prop_type_simplified,
         price_4_nights)

listings3 <- listings2 

#Grouping neighbourhoods
listings3 <- listings3 %>%    
  mutate(neighbourhood_simplified=case_when(neighbourhood_cleansed=="Bruxelles"|neighbourhood_cleansed=="Etterbeek"|neighbourhood_cleansed=="Ixelles"|neighbourhood_cleansed=="Saint-Gilles"|neighbourhood_cleansed=="Saint-Josse-ten-Noode"|neighbourhood_cleansed=="Schaerbeek"~"Central",neighbourhood_cleansed=="Berchem-Sainte-Agathe"|neighbourhood_cleansed=="Evere"|neighbourhood_cleansed=="Ganshoren"|neighbourhood_cleansed=="Koekelberg"|neighbourhood_cleansed=="Jette"~"North",neighbourhood_cleansed=="Uccle"|neighbourhood_cleansed=="Watermael-Boitsfort"~"South",neighbourhood_cleansed=="Anderlecht"|neighbourhood_cleansed=="Forest"|neighbourhood_cleansed=="Molenbeek-Saint-Jean"~"West",neighbourhood_cleansed=="Auderghem"|neighbourhood_cleansed=="Woluwe-Saint-Lambert"|neighbourhood_cleansed=="Woluwe-Saint-Pierre"~"East", TRUE~neighbourhood_cleansed))

#Adding log variable
listings3 <- listings3 %>%
  mutate(log_price_4_nights=log(price_4_nights))
plot2 <- listings3 %>%
  ggplot(aes(x=price_4_nights))+
   geom_histogram(color="blue",fill="tomato2")
plot2

plot3 <- listings3 %>%
  ggplot(aes(x=log_price_4_nights))+
   geom_histogram(color="blue",fill="tomato2")
plot3

The first reason why we will use a log dependent variable (log_price_4_nights) is the interpretation. It is more meaningful to see how each explanatory variable affects the dependent variable in terms of percentages than in the absolute terms. Secondly, as can be seen from the plots above, when we transform the price_4_nights in the log form, it more closely follows a normal distribution while the normal form is skewed to the right. So, it might help us reduce heteroscedasticity of residuals and improve the precision of our estimates.

The goal is to determine how a variety of different variables affects price for 4 nights in an Airbnb accommodation in Bruxelles. We start by making a model where the dependent variable is log_price_4_nights, and the explanatory variables are prop_type_simplified, number_of_reviews, and review_scores_rating. When there is a dependent variable in a log form, coefficients are not interpreted in the standard way. Specifically, if we have a coefficient B1, then that means that a one unit increase in the variable X1 increases the dependent variable by (e^B1-1). But for small enough values of B1, e^B1 - 1 is roughly equal to B1. This is a good approximation when -0.1 < B < 0.1. For values outside of this range, we will use e^B1 - 1 to calculate the effect.

In each regression, we will use vif() to check whether there exists multicollinearity among the explanatory variables. Similarly, we will use autoplot() and see how the model satisfies 4 main OLS assumptions.

#Regression model 1
  model1 <-  lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating, data=listings3)
  table1 <- get_regression_table(model1)
  table1
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept5.37 0.07869.2 0    5.21 5.52 
prop_type_simplifiedCondominium0.0850.0342.470.0130.0180.152
prop_type_simplifiedHouse-0.0750.026-2.880.004-0.126-0.024
prop_type_simplifiedOther0.2850.02312.5 0    0.2410.33 
prop_type_simplifiedTownhouse-0.1170.036-3.230.001-0.188-0.046
number_of_reviews0    0    -1.130.2580    0    
review_scores_rating0.0030.0013.520    0.0010.005
  vif(model1)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.01  4               1
## number_of_reviews    1.00  1               1
## review_scores_rating 1.00  1               1
  autoplot(model1)

With the prop type simplified, we have 4 dummy variables where the omitted group is “apartment”, so coefficients need to be interpreted in relation with the apartment group. For instance, on the prop_type_simplifiedHouse the coefficient is -0.075. This means that on average holding other things constant, houses are roughly 7.5% cheaper than the apartments. Similarly, townhouses are roughly 11% cheaper than apartments, and the accommodations belonging to the “other group” are around 33% more expensive. Condominiums are correlated with 8.5% higher prices than in the apartments. In those 4 cases, the coefficients are significant as p-values are very low. Furthermore, the coefficient on review_scores_rating is 0.003, meaning that a unit increase in this variable, holding everything else constant, increases prices by roughly 0.3%. This statistic is significant. Number_of_reviews seems to have no effect on prices as the coefficient is ~0.

There seems to be no collinearity among explanatory variables as VIF values are all significantly lower than 5. In this model, R-sqaured is very low and there is a heteroscedasticity of residuals, which is a violation of one of the main assumptions of linear regressions. Although heteroscedasticity does not lead to bias in the estimates, it might make them less precise and lead to incorrect standard errors. Hence, in the subsequent models we will add new variables. In addition, even though the above-mentioned coefficients seem to have effect on the dependent variable, this does not necessarily imply causation. When variables which are correlated with both some of the explanatory variables and with the dependent variable are not included in the model, coefficients on some variables cause ‘omitted variables bias’. For instance, it might not necessarily be the case that the type of the accommodation being a house automatically leads to lower prices, rather it might be that the houses are generally located in cheaper regions than apartments, or that the people renting houses take only one room whereas people renting apartments usually take the whole place. That is also why in the next models, we control for variables which might cause this bias (confounding variables).

#Regression model2, with room_type
  
model2 <- lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating+room_type, data=listings3)
  table2 <- get_regression_table(model2)
  table2
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept5.49 0.06881   0    5.36 5.62 
prop_type_simplifiedCondominium0.0720.03 2.4 0.0170.0130.13 
prop_type_simplifiedHouse0.1830.0247.780    0.1370.229
prop_type_simplifiedOther0.3170.02115.3 0    0.2770.358
prop_type_simplifiedTownhouse0.1850.0325.710    0.1220.249
number_of_reviews0    0    -2.230.0260    0    
review_scores_rating0.0030.0014.320    0.0020.005
room_typeHotel room0.1630.0632.580.01 0.0390.287
room_typePrivate room-0.5890.014-41.3 0    -0.617-0.561
room_typeShared room-0.7  0.085-8.260    -0.866-0.534
  vif(model2)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.24  4            1.03
## number_of_reviews    1.01  1            1.00
## review_scores_rating 1.00  1            1.00
## room_type            1.23  3            1.04
  autoplot(model2)

In this regression, we add variable room_type, which can take 4 values: entire house, shared room, private room, and hotel room. WE created 3 dummies for hotel, private and shared rooms. Hence, the omitted group is entire house and the coefficients on these dummies must be interpreted with respect to this omitted group. Firstly, the coefficient on the “hotel room” dummy, is 0.163, meaning that, ceteris paribus, hotel rooms are on average 17.7% more expensive than entire houses for 4 nights. On the private room dummy, we get a coefficient of -0.589, when transformed, tells us that private rooms are roughly 45% cheaper than entire houses, holding everything else constant. The largest gap is with shared rooms, which seem to be around 50% cheaper than the entire houses. All those statistics are significant even at 1% level. It is interesting to note that for instance, the coefficient on prop_type_simplifiedHouse completely changed, and is now positive. It means that it was negatively biased in the first regression, and it was capturing some of the effect effect of the room_type.

#Adding new explanatory variables - bathrooms, bedrooms, and beds
  
  model3 <- lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating+room_type+bathrooms+bedrooms+beds, data=listings3)
  table3 <- get_regression_table(model3)
  table3
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept5.02 0.06380.2  0    4.9  5.14 
prop_type_simplifiedCondominium0.0670.0272.52 0.0120.0150.12 
prop_type_simplifiedHouse-0.0030.022-0.1580.874-0.0460.039
prop_type_simplifiedOther0.2360.01912.6  0    0.1990.273
prop_type_simplifiedTownhouse0.0190.0290.6310.528-0.0390.076
number_of_reviews0    0    -2.23 0.0260    0    
review_scores_rating0.0040.0015.36 0    0.0020.005
room_typeHotel room0.2470.0574.36 0    0.1360.358
room_typePrivate room-0.4570.013-34.1  0    -0.483-0.43 
room_typeShared room-0.6230.08 -7.8  0    -0.779-0.466
bathrooms0.1020.0147.34 0    0.0750.129
bedrooms0.1440.01 14.8  0    0.1250.163
beds0.07 0.00612.2  0    0.0590.082
  vif(model3)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.34  4            1.04
## number_of_reviews    1.01  1            1.01
## review_scores_rating 1.01  1            1.00
## room_type            1.36  3            1.05
## bathrooms            1.37  1            1.17
## bedrooms             2.07  1            1.44
## beds                 1.98  1            1.41
  autoplot(model3)

In this model, WE added variables ‘bathrooms’, ‘bedrooms’, and ‘beds’. The coefficients on all three new variables are very significant (p-values are very low) and they have a significant effect on prices, as expected. For instance, the coefficient on bedrooms is 0.144, which means that a unit increase in bedroom leads to roughly a 15.5% increase in prices. Similarly, the coefficients for beds and bathrooms are 0.07 and 0.1, respectively. Like in the previous model, the coefficients on the house and townhouse dummies changed, and they are now insignificant.

   model4 <- lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating+room_type+bathrooms+bedrooms+beds+accommodates, data=listings3)
  table4 <- get_regression_table(model4)
  table4
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept4.82 0.06278.4  0    4.7  4.94 
prop_type_simplifiedCondominium0.0710.0262.73 0.0060.02 0.122
prop_type_simplifiedHouse0    0.021-0.0170.986-0.0420.041
prop_type_simplifiedOther0.2170.01811.9  0    0.1810.253
prop_type_simplifiedTownhouse0.0170.0280.6030.547-0.0390.073
number_of_reviews0    0    -2.88 0.0040    0    
review_scores_rating0.0040.0016.96 0    0.0030.006
room_typeHotel room0.3040.0555.54 0    0.1960.412
room_typePrivate room-0.3860.013-28.6  0    -0.412-0.359
room_typeShared room-0.4980.078-6.42 0    -0.65 -0.346
bathrooms0.0650.0144.76 0    0.0380.091
bedrooms0.0820.01 8.23 0    0.0630.102
beds-0.0120.007-1.74 0.082-0.0260.002
accommodates0.1120.00619.3  0    0.1  0.123
  vif(model4)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.34  4            1.04
## number_of_reviews    1.01  1            1.01
## review_scores_rating 1.02  1            1.01
## room_type            1.48  3            1.07
## bathrooms            1.40  1            1.18
## bedrooms             2.31  1            1.52
## beds                 3.14  1            1.77
## accommodates         3.69  1            1.92
  autoplot(model4)

In the model 4, we add variable ‘accommodates’ which signifies the size of the accommodation. The coefficient on the variable ‘accommodates’ is 0.112, meaning that a unit increase increases price by around 11.85%, holding other variables constant. At this point, there is still no high multicollinearity among the variables, as Variance Inflation Factor (VIf) is not higher than 5 for any variable.

  model5 <- lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating+room_type+bathrooms+bedrooms+beds+accommodates+host_is_superhost+ is_location_exact, data=listings3)
  table5 <- get_regression_table(model5)
  table5
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept4.91 0.06477.2  0    4.78 5.03 
prop_type_simplifiedCondominium0.0680.0262.63 0.0090.0170.118
prop_type_simplifiedHouse-0.0020.021-0.0910.927-0.0430.039
prop_type_simplifiedOther0.2180.01812    0    0.1820.253
prop_type_simplifiedTownhouse0.0030.0280.1180.906-0.0520.059
number_of_reviews0    0    -4.58 0    0    0    
review_scores_rating0.0030.0014.7  0    0.0020.004
room_typeHotel room0.2950.0555.42 0    0.1880.402
room_typePrivate room-0.3830.013-28.6  0    -0.41 -0.357
room_typeShared room-0.4820.077-6.24 0    -0.633-0.33 
bathrooms0.0620.0144.55 0    0.0350.088
bedrooms0.0810.01 8.17 0    0.0620.1  
beds-0.0120.007-1.7  0.09 -0.0260.002
accommodates0.1130.00619.7  0    0.1020.125
host_is_superhostTRUE0.11 0.0147.98 0    0.0830.137
is_location_exactTRUE0.0270.0151.82 0.068-0.0020.056
  vif(model5)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.35  4            1.04
## number_of_reviews    1.06  1            1.03
## review_scores_rating 1.09  1            1.04
## room_type            1.48  3            1.07
## bathrooms            1.40  1            1.18
## bedrooms             2.31  1            1.52
## beds                 3.14  1            1.77
## accommodates         3.70  1            1.92
## host_is_superhost    1.13  1            1.06
## is_location_exact    1.01  1            1.00
  autoplot(model5)

We can also check whether the facts that a host is a superhost and whether the location is exact lead to higher prices when controlling for other variables. The coefficient on the superhost dummy is 0.11 meaning that being a superhost is correlated with roughly 11% higher prices, ceteris paribus. This coefficient is significant at 1 % level. On the other hand, the coefficient on the exact location dummy is positive and equals 0.027, however it is not significant at the 5% level. Furthermore, the VIF remains the highest on ‘accommodates’ but it is still lower than 5.

  model6 <- lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating+room_type+bathrooms+bedrooms+beds+accommodates+host_is_superhost+ is_location_exact+cancellation_policy, data=listings3)
  table6 <- get_regression_table(model6)
  table6
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept4.88 0.06377    0    4.75 5    
prop_type_simplifiedCondominium0.0640.0262.48 0.0130.0130.114
prop_type_simplifiedHouse-0.0050.021-0.2190.826-0.0450.036
prop_type_simplifiedOther0.2170.01812    0    0.1810.252
prop_type_simplifiedTownhouse0.0050.0280.1790.858-0.05 0.06 
number_of_reviews0    0    -5.57 0    -0.0010    
review_scores_rating0.0030.0014.73 0    0.0020.004
room_typeHotel room0.2970.0565.35 0    0.1880.406
room_typePrivate room-0.3750.013-28.1  0    -0.402-0.349
room_typeShared room-0.4920.077-6.41 0    -0.643-0.342
bathrooms0.06 0.0134.48 0    0.0340.087
bedrooms0.0830.01 8.39 0    0.0630.102
beds-0.0110.007-1.59 0.111-0.0250.003
accommodates0.11 0.00619.2  0    0.0990.122
host_is_superhostTRUE0.1050.0147.61 0    0.0780.132
is_location_exactTRUE0.0290.0151.95 0.0510    0.057
cancellation_policymoderate0.0290.0132.25 0.0240.0040.054
cancellation_policystrict_14_with_grace_period0.1050.0138.1  0    0.08 0.131
cancellation_policysuper_strict_300.0250.2340.1070.915-0.4330.483
  vif(model6)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.36  4            1.04
## number_of_reviews    1.08  1            1.04
## review_scores_rating 1.09  1            1.04
## room_type            1.57  3            1.08
## bathrooms            1.40  1            1.18
## bedrooms             2.31  1            1.52
## beds                 3.14  1            1.77
## accommodates         3.71  1            1.93
## host_is_superhost    1.14  1            1.07
## is_location_exact    1.01  1            1.00
## cancellation_policy  1.12  3            1.02
  autoplot(model6)

In this model we add cancellation policy in the regression and see whether it correlates with the higher prices. Three dummies are created corresponding to strict, super strict and moderate cancellation policy, and the omitted group is ‘flexible’. Compared to flexible, other types of policies seem to correlate with higher prices, however the estimate on the super strict dummy is not significant due to a very low sample size. Of course, this cannot be interpreted as a causal impact, and actually there might be a reverse relationship, since higher quality and hence more expensive apartments might have stricter policies for instance.

   model7 <- lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating+room_type+bathrooms+bedrooms+beds+accommodates+host_is_superhost+ is_location_exact+cancellation_policy+neighbourhood_simplified, data=listings3)
  table7 <- get_regression_table(model7)
  table7
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept4.91 0.06378    0    4.79 5.03 
prop_type_simplifiedCondominium0.0750.0252.93 0.0030.0250.124
prop_type_simplifiedHouse0.0080.0210.4050.685-0.0330.049
prop_type_simplifiedOther0.2230.01812.4  0    0.1880.258
prop_type_simplifiedTownhouse0.0030.0280.1110.912-0.0520.058
number_of_reviews0    0    -6.18 0    -0.0010    
review_scores_rating0.0030.0014.78 0    0.0020.004
room_typeHotel room0.2790.0555.06 0    0.1710.387
room_typePrivate room-0.3680.013-27.7  0    -0.394-0.342
room_typeShared room-0.4940.076-6.49 0    -0.643-0.345
bathrooms0.0550.0134.15 0    0.0290.082
bedrooms0.0870.01 8.84 0    0.0670.106
beds-0.0070.007-1.05 0.296-0.0210.006
accommodates0.1080.00618.9  0    0.0970.119
host_is_superhostTRUE0.1060.0147.81 0    0.08 0.133
is_location_exactTRUE0.02 0.0151.38 0.169-0.0090.049
cancellation_policymoderate0.0270.0132.11 0.0350.0020.052
cancellation_policystrict_14_with_grace_period0.1050.0138.19 0    0.08 0.131
cancellation_policysuper_strict_300.0120.2320.0530.958-0.4420.466
neighbourhood_simplifiedEast-0.07 0.024-2.88 0.004-0.118-0.022
neighbourhood_simplifiedNorth-0.1980.029-6.82 0    -0.256-0.141
neighbourhood_simplifiedSouth-0.0730.024-3.08 0.002-0.119-0.026
neighbourhood_simplifiedWest-0.1220.016-7.45 0    -0.154-0.09 
  vif(model7)
##                          GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified     1.41  4            1.04
## number_of_reviews        1.09  1            1.04
## review_scores_rating     1.09  1            1.04
## room_type                1.58  3            1.08
## bathrooms                1.41  1            1.19
## bedrooms                 2.32  1            1.52
## beds                     3.15  1            1.78
## accommodates             3.73  1            1.93
## host_is_superhost        1.14  1            1.07
## is_location_exact        1.01  1            1.01
## cancellation_policy      1.12  3            1.02
## neighbourhood_simplified 1.09  4            1.01
  autoplot(model7)

Finally, we grouped 19 different neighbourhoods of Bruxelles according to their location in 5 groups: Central, West, North, East, and South. In this model, we will also control for these variables. The omitted group is ‘Central’ and the corresponding coefficients on dummies tell us about the gap between the particular region and the central part. Compared to the central part, other parts of the city correlate with lower prices holding other variables constant and estimates are statistically significant. Omitted variables bias in the previous regressions might be caused by location, as location is correlated with both price and the variables such as number_of_reviews, exact location, rating etc. In addition, we check whether this model satisfies the main LINE assumptions of linear regressions by using autoplot. We can see that there exists a clear linear relationship between price_4_nights and explanatory variables, and that residuals are roughly normally distributed except at the borders (Normal Q-Q plot). In addition, it seems that there is no heteroscedasticity and that variance of the residuals is constant. In addition, there is no multicollinearity. Hence, we will use this model to make a prediction.

  #Overall table
  huxreg(model1,model2,model3,model4,model5,model6,model7,statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma'))
(1)(2)(3)(4)(5)(6)(7)
(Intercept)5.366 ***5.489 ***5.022 ***4.824 ***4.908 ***4.875 ***4.910 ***
(0.078)   (0.068)   (0.063)   (0.062)   (0.064)   (0.063)   (0.063)   
prop_type_simplifiedCondominium0.085 *  0.072 *  0.067 *  0.071 ** 0.068 ** 0.064 *  0.075 ** 
(0.034)   (0.030)   (0.027)   (0.026)   (0.026)   (0.026)   (0.025)   
prop_type_simplifiedHouse-0.075 ** 0.183 ***-0.003    -0.000    -0.002    -0.005    0.008    
(0.026)   (0.024)   (0.022)   (0.021)   (0.021)   (0.021)   (0.021)   
prop_type_simplifiedOther0.285 ***0.317 ***0.236 ***0.217 ***0.218 ***0.217 ***0.223 ***
(0.023)   (0.021)   (0.019)   (0.018)   (0.018)   (0.018)   (0.018)   
prop_type_simplifiedTownhouse-0.117 ** 0.185 ***0.019    0.017    0.003    0.005    0.003    
(0.036)   (0.032)   (0.029)   (0.028)   (0.028)   (0.028)   (0.028)   
number_of_reviews-0.000    -0.000 *  -0.000 *  -0.000 ** -0.000 ***-0.000 ***-0.000 ***
(0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   
review_scores_rating0.003 ***0.003 ***0.004 ***0.004 ***0.003 ***0.003 ***0.003 ***
(0.001)   (0.001)   (0.001)   (0.001)   (0.001)   (0.001)   (0.001)   
room_typeHotel room        0.163 ** 0.247 ***0.304 ***0.295 ***0.297 ***0.279 ***
        (0.063)   (0.057)   (0.055)   (0.055)   (0.056)   (0.055)   
room_typePrivate room        -0.589 ***-0.457 ***-0.386 ***-0.383 ***-0.375 ***-0.368 ***
        (0.014)   (0.013)   (0.013)   (0.013)   (0.013)   (0.013)   
room_typeShared room        -0.700 ***-0.623 ***-0.498 ***-0.482 ***-0.492 ***-0.494 ***
        (0.085)   (0.080)   (0.078)   (0.077)   (0.077)   (0.076)   
bathrooms                0.102 ***0.065 ***0.062 ***0.060 ***0.055 ***
                (0.014)   (0.014)   (0.014)   (0.013)   (0.013)   
bedrooms                0.144 ***0.082 ***0.081 ***0.083 ***0.087 ***
                (0.010)   (0.010)   (0.010)   (0.010)   (0.010)   
beds                0.070 ***-0.012    -0.012    -0.011    -0.007    
                (0.006)   (0.007)   (0.007)   (0.007)   (0.007)   
accommodates                        0.112 ***0.113 ***0.110 ***0.108 ***
                        (0.006)   (0.006)   (0.006)   (0.006)   
host_is_superhostTRUE                                0.110 ***0.105 ***0.106 ***
                                (0.014)   (0.014)   (0.014)   
is_location_exactTRUE                                0.027    0.029    0.020    
                                (0.015)   (0.015)   (0.015)   
cancellation_policymoderate                                        0.029 *  0.027 *  
                                        (0.013)   (0.013)   
cancellation_policystrict_14_with_grace_period                                        0.105 ***0.105 ***
                                        (0.013)   (0.013)   
cancellation_policysuper_strict_30                                        0.025    0.012    
                                        (0.234)   (0.232)   
neighbourhood_simplifiedEast                                                -0.070 ** 
                                                (0.024)   
neighbourhood_simplifiedNorth                                                -0.198 ***
                                                (0.029)   
neighbourhood_simplifiedSouth                                                -0.073 ** 
                                                (0.024)   
neighbourhood_simplifiedWest                                                -0.122 ***
                                                (0.016)   
#observations5670        5670        5645        5645        5645        5645        5645        
R squared0.035    0.265    0.415    0.451    0.458    0.464    0.474    
Adj. R Squared0.034    0.264    0.413    0.450    0.456    0.462    0.472    
Residual SE0.528    0.461    0.411    0.398    0.396    0.394    0.390    
*** p < 0.001; ** p < 0.01; * p < 0.05.
#Prediction
  
  model_data <- data.frame(prop_type_simplified = "Apartment",number_of_reviews = 10,review_scores_rating = 95,room_type="Private room",bathrooms=1,bedrooms=2,beds=2,host_is_superhost=FALSE,is_location_exact = FALSE, cancellation_policy = "moderate",neighbourhood_simplified = "Central",accommodates=2)
  
predict(model7,newdata=model_data, interval = "prediction")
##    fit  lwr  upr
## 1 5.29 4.52 6.05

Using model7, we try to predict the price of accommodation for 2 persons for 4 nights. We included the data shown in the chunk above. From this, we get that the price prediction is $198 with the wide confidence interval ranging from roughly 91 dollars to 425 dollars.